Fuzzy Matching Addresses to Prevent Fraudulent Application

Application fraud refers to fraud committed by submitting a new credit application with fraudulent details to a credit provider. Normally, fraudsters collect the personal and financial data of innocent users from the identity documents, pay slips, bank statements, and other source documents to commit the application fraud. The information collected from all these documents will be either forged or sometimes the document itself will be stolen illegally or the details in the documents will be changed for the purpose of submitting a new credit application. In recent years, the internet is serving as an appealing place to collect such information. If we have the addresses of the fraudulent application already encountered, we can prevent it matching these addresses with the adress of a new applicant.
The most appropiate method to do this is via Fuzzy Matching. It works with matches that may be less than 100% perfect when finding correspondences between segments of a text, in or case the text is the address of a house.

Here below, is an example of the problem.

# Creaate an Example of Addresses
name_a <- c("Aldo", "Andrea", "Alberto", "Antonio", "Angelo")
name_b <- c("Sara", "Serena", "Silvia", "Sonia", "Sissi")
zip_street_a <- c("1204 Roma Street 8", "1204 Roma Street 8", "1204 Roma Street 8", "1204 Venezia street 10", "1204 Venezia Street 110")
zip_street_b <- c("1204 Roma Street 81", "1204 Roma Street 8A", "1204 Roma Street 8B", "1204 Roma Street 8C", "1204 Venezia Street 10C")
db_a <- data.frame(name_a, zip_street_a)
db_b <- data.frame(name_b, zip_street_b)
names(db_a)[names(db_a)=='zip_street_a'] <- 'zipstreet'
names(db_b)[names(db_b)=='zip_street_b'] <- 'zipstreet'


# Use Fuzzy Matching
library(fuzzyjoin)
library(dplyr)

match_data <- stringdist_left_join(db_a, db_b,
              by = "zipstreet", ignore_case = TRUE,
              method = "jaccard",
              max_dist = 1,
              distance_col = "dist"
) %>%
group_by(zipstreet.x)

match_data

# A tibble: 25 x 5
# Groups:   zipstreet.x [3]
   name_a zipstreet.x        name_b zipstreet.y               dist
   <fct>  <fct>              <fct>  <fct>                    <dbl>
 1 Aldo   1204 Roma Street 8 Sara   1204 Roma Street 81     0     
 2 Aldo   1204 Roma Street 8 Serena 1204 Roma Street 8A     0     
 3 Aldo   1204 Roma Street 8 Silvia 1204 Roma Street 8B     0.0714
 4 Aldo   1204 Roma Street 8 Sonia  1204 Roma Street 8C     0.0714
 5 Aldo   1204 Roma Street 8 Sissi  1204 Venezia Street 10C 0.444 
 6 Andrea 1204 Roma Street 8 Sara   1204 Roma Street 81     0     
 7 Andrea 1204 Roma Street 8 Serena 1204 Roma Street 8A     0     
 8 Andrea 1204 Roma Street 8 Silvia 1204 Roma Street 8B     0.0714
 9 Andrea 1204 Roma Street 8 Sonia  1204 Roma Street 8C     0.0714
10 Andrea 1204 Roma Street 8 Sissi  1204 Venezia Street 10C 0.444 
# ... with 15 more rows

The script works fine. But we would like to have different distance between the following address combinations:

1204 Roma Street 8 vs. 1204 Roma Street 81
1204 Roma Street 8 vs. 1204 Roma Street 8A

In fact, Roma Street number 81 is very far from Roma Street 8. On the other hand, Roma Street number 8A is very close to Roma Street number 8.
So, we need to have a distance very close to 0 for 8A, and far from 0 for 81.
Summarizing: we would like afuzzy matching able to take into account also distance from numbers and combination of number with letters (e.g 88 is far from 8A but is close to 85).

To solve this, we can use a function to convert the house number.

# Convert letter to number
L2n <- function(x) {
num <- gsub("[^[:digit:]]", "", x)
letter <- gsub("[[:digit:]]", "", x)
conv <- ifelse(letter == "",0,(utf8ToInt(letter) - utf8ToInt("A")+1L)*0.038)
a <- as.double(num) + as.double(conv)
return(a)
}

# Create dataset
name_a <- c("Aldo", "Andrea", "Alberto", "Antonio", "Angelo")
name_b <- c("Sara", "Serena", "Silvia", "Sonia", "Sissi")
zip_street_a <- c("1204 Roma Street", "1204 Roma Street", "1204 Roma Street", "1204 Venezia street", "1204 Venezia Street")
zip_street_b <- c("1204 Roma Street", "1204 Roma Street", "1204 Roma Street", "1204 Roma Street", "1204 Venezia Street")
numbers_a <- c("8", "8", "8", "10", "110")
numbers_b <- c("81", "8A", "8B", "8C", "10C")
# db_a <- data.frame(name_a, zip_street_a, numbers_a, stringsAsFactors = F)
# db_b <- data.frame(name_b, zip_street_b, numbers_b, stringsAsFactors = F)
db_a <- data.frame(zip_street_a, numbers_a, stringsAsFactors = F)
db_b <- data.frame(zip_street_b, numbers_b, stringsAsFactors = F)
# colnames(db_a) <- c("name","zipstreet","number")
# colnames(db_b) <- c("name","zipstreet","number")
colnames(db_a) <- c("zipstreet","number")
colnames(db_b) <- c("zipstreet","number")

# Use Fuzzy Matching
match_data <- stringdist_left_join(db_a, db_b,
                                   by = "zipstreet", ignore_case = TRUE,
                                   method = "jaccard",
                                   max_dist = 1,
                                   distance_col = "dist") %>%
                                   mutate(numdist = abs(sapply(number.x, L2n) - sapply(number.y, L2n)))

match_data

           zipstreet.x number.x         zipstreet.y number.y  dist numdist
1     1204 Roma Street        8    1204 Roma Street       81 0.000  73.000
2     1204 Roma Street        8    1204 Roma Street       8A 0.000   0.038
3     1204 Roma Street        8    1204 Roma Street       8B 0.000   0.076
4     1204 Roma Street        8    1204 Roma Street       8C 0.000   0.114
5     1204 Roma Street        8 1204 Venezia Street      10C 0.375   2.114
6     1204 Roma Street        8    1204 Roma Street       81 0.000  73.000
7     1204 Roma Street        8    1204 Roma Street       8A 0.000   0.038
8     1204 Roma Street        8    1204 Roma Street       8B 0.000   0.076
9     1204 Roma Street        8    1204 Roma Street       8C 0.000   0.114
10    1204 Roma Street        8 1204 Venezia Street      10C 0.375   2.114
11    1204 Roma Street        8    1204 Roma Street       81 0.000  73.000
12    1204 Roma Street        8    1204 Roma Street       8A 0.000   0.038
13    1204 Roma Street        8    1204 Roma Street       8B 0.000   0.076
14    1204 Roma Street        8    1204 Roma Street       8C 0.000   0.114
15    1204 Roma Street        8 1204 Venezia Street      10C 0.375   2.114
16 1204 Venezia street       10    1204 Roma Street       81 0.375  71.000
17 1204 Venezia street       10    1204 Roma Street       8A 0.375   1.962
18 1204 Venezia street       10    1204 Roma Street       8B 0.375   1.924
19 1204 Venezia street       10    1204 Roma Street       8C 0.375   1.886
20 1204 Venezia street       10 1204 Venezia Street      10C 0.000   0.114
21 1204 Venezia Street      110    1204 Roma Street       81 0.375  29.000
22 1204 Venezia Street      110    1204 Roma Street       8A 0.375 101.962
23 1204 Venezia Street      110    1204 Roma Street       8B 0.375 101.924
24 1204 Venezia Street      110    1204 Roma Street       8C 0.375 101.886
25 1204 Venezia Street      110 1204 Venezia Street      10C 0.000  99.886

As we can see from the table above, noow the distrances between 8A, 8B, 8C are very closed to each other and the distance between 8A and 8B is less than 8A and 8C.
The function L2n (letter to number) calculates the utf8 code of the letter that it is passed to, subtracts from this value the utf8 code of the letter “A” and adds to this value the number one to ensure that R’s indexing convention is observed, according to which the numbering of the letters starts at 1, and not at 0.
To calculate the distance we use the Jaccard distance which is one minus the quotient of shared N-grams and all observed N-grams.
We could also use the Levenshtein distance which is the minimal number of insertions, deletions and replacements needed for transforming string A into string B.