Record comparison with recordlinkage | NetLab

Welcome to the third installment of a five part tutorial series on the recordlinkage python package. Each tutorial will cover a specific stage of the data integration workflow. The topics and links for each tutorial are included below:

Pre-processing with recordlinkage
Indexing candidate links with recordlinkage
Record comparison with recordlinkage
Record pair classification with recordlinkage
Data fusion (coming soon …)

Goal

By the end of this tutorial you should be comfortable using recordlinkage’s built-in methods and custom functions for record comparison.

Record comparison is an important part of the data integration workflow, as it provides the basis for classifying matches and the eventual fusion of data. Both of these steps will be covered in subsequent tutorials.

In this tutorial, we will compare two data sets. The first includes bibliographic data, which was retrieved using the metaknowledge python package. This data includes information on affiliations, papers, authors, etc. The second dataset is a collection of institutions from GRID. Both data sets have been filtered to only include institutions from Canada. Five rows from each of these datasets have been included below.

Affiliation data

	Affiliation	Country
7	university of alberta 2 51 south academic building edmonton ab canada t6g2g7	Canada
1	departement de microbiologie infectiologie et immunologieuniversite de montreal montreal quebec h3c 3j7 canada electronic addressroxannecollinumontrealca	Canada
10	getting to know cancer room 229a 36 arthur st truro nova scotia b2n 1x5canada	Canada
17	michael smith laboratories university of british columbia vancouver britishcolumbia canada	Canada
13	gastrointestinal research group and inflammation research network department of physiology and pharmacology calvin joan and phoebe snyder institute for chronicdiseases cumming school of medicine university of calgary calgary albertacanada	Canada

GRID data

	name	city	state	country
2457	holstein association of canada	Brantford	Ontario	Canada
2499	yellow island aquaculture	Victoria	British Columbia	Canada
3094	mitel	Ottawa	Ontario	Canada
1370	tekion	Toronto	Ontario	Canada
1030	toronto general hospital	Toronto	Ontario	Canada

Comparing records

Preliminary work

There is a bit of work we need to do before we can compare records in our data sets. First, we need to load and pre-process the data. Then we will need to index the datasets to give us our list of candidate links. Each of these topics have been covered in detailed in previous tutorials (links at the top of the page). The script below contains the code used to prepare the data for comparisons.

import recordlinkage as rl import pandas as pd # *************************** # Load Bib Data # *************************** bib = pd.read_csv("/path/to/canada_bib.csv") # Read bibliometric data bib["Affiliation"] = clean(bib["Affiliation"]) # Clean Affiliation Data # *************************** # Load GRID Data # *************************** grid = pd.read_csv("/path/to/grid.csv") # Read GRID data grid["name"] = clean(grid["name"]) # Clean GRID data # *************************** # Index with Full Index # *************************** indexer = rl.FullIndex() candidate_links = indexer.index(bib, grid)

Set up the Compare object

In recordlinkage you must initiate a Compare object prior to performing any comparison functionality between records. This object stores both dataframes, the candidate links, and a vector containing comparison results. Further, the Compare object contains the methods for performing comparisons. The code block below initializes the comparison object.

compare = rl.Compare(candidate_links, bib, grid)

Built-in comparison methods

Now that we have initiated a comparison method we can go ahead and start comparing records. The simplest way to do comparisons is to use comparison methods that have been built-in to recordlinkage. Currently there are five specific comparison methods within recordlinkage: Compare.exact(), Compare.string(), Compare.numeric(), Compare.geo(), and Compare.date(). These are all well documented in the documentation. For this example we will be comparing the Affiliation column from the bibliometric dataset and the name column from the grid dataset.

We will start by using the Compare.exact() comparison method. This method is simple, if two values are an exact match a comparison score of 1 is returned, otherwise 0 is retured. Since the affiliation data we are trying to match is quite messy, this method returns very low scores. In fact, looking at the describe table included below the code block, you can see that no exact matches were found between the two columns.

Next, we will use the Compare.string() comparison method. This method is a bit more complicated and generates a score based on well known string-comparison algorithms. For this example, we will use the Levenshtein method. However, as shown in the describe table below, we still aren’t getting very high comparison scores, with the highest only being 0.52 (of a maximum 1.0).

# Use built-in comparison functions compare.exact("Affiliation", "name", name="exact") compare.string("Affiliation", "name", name="string") # Print description print(compare.vectors.describe())

	exact	string
count	87920	87920
mean	0	0.152443
std	0	0.0587379
min	0	0
25%	0	0.111111
50%	0	0.15
75%	0	0.19
max	0	0.521739

Custom comparison methods

Based on the lack of high comparison scores, it doesn’t seem as though recordlinkage’s built-in methods are going to work for our current use case. Thankfully, there is a general comparison method built-in to recordlinkage called Compare.compare(). This method takes in a comparison function, two columns to compare, and the name of the column the score should be stored in. So, all we have to do is define and call a custom comparison function.

To demonstrate how the Compare.compare() method works, we will use a couple of custom comparison function that has already been defined in NetLab’s labutils package. We will use the normed_lcss() and normed_fuzzy_lcss() functions. normed_lcss computes a comparison score based on the length of the longest common substring between two strings. This score is normalized based on the length of the longest possible substring (the lesser of the two string lengths). normed_fuzzy_lcss() computes a score based on the prescence of similar substrings within the two strings. Once again this is normalized based on the length of the longest possible substring. For more information on this function see the documentation.

The code block below shows how each of these functions can be implemented using the generalized Compare.compare() method in recordlinkage.

# Import custom functions from labutils import normed_lcss, normed_fuzzy_lcss # Perform the comparison compare.compare(normed_lcss, "Affiliation", "name", name="lcss") compare.compare(normed_fuzzy_lcss, "Affiliation", "name", name="fuzzy_lcss") # Print new description, including the 99th and 99.99th percentiles compare.vectors.describe([0.99, 0.9999])

	exact	string	lcss	fuzzy_lcss
count	87920	87920	87920	87920
mean	0	0.152443	0.184052	0.193961
std	0	0.0587379	0.0997552	0.102741
min	0	0	0	0
50%	0	0.15	0.163265	0.172414
99%	0	0.321429	0.5	0.527778
99.99%	0	0.5	0.7	0.875
max	0	0.521739	0.9	0.9

As we can see in the table above, both custom methods are resulting in higher comparison scores, but are still spread out between 0 and 1. The occurence of rare high scores helps to identify candidate links which likely correspond to true matches. The table below shows the results of comparisons on four pairs of records, two having low scores and two having high scores.

	Affiliation_l	name_r	string	lcss	fuzzy_lcss
(26, 2507)	institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9	steel structures education foundation	0.202381	0.0810811	0.0810811
(22, 2225)	transgenic core facility clinical research institute of montreal montreal qc canada	grand river conservation authority	0.228916	0.0882353	0.0882353
(7, 8)	university of alberta 2 51 south academic building edmonton ab canada t6g2g7	ualberta	0.105263	0.875	0.875
(17, 11)	michael smith laboratories university of british columbia vancouver britishcolumbia canada	university of british columbia	0.333333	1	1

Defining custom comparison methods

Once you know how to use custom comparison methods, the next important step is knowing how to create a custom method. There is additional information in the recordlinkage documentation, but I hope this example can provide some helpful additional information.

The code block below shows a template that can be used for create a custom comparison method. A custom comparison method takes in two Pandas.Series (columns) as well as any additional arguments that that specific function might need. To use the template below all you need to do is add to the inner_apply() method, writing the functionality to compare two values and produce a comparison score.

def my_compare(s1, s2): # This combines the columns you are comparing into a single DataFrame concat = pd.concat([s1, s2], axis=1, ignore_index=True) def inner_apply(x): """ This is where your custom algorithm is housed. Create a function to be applied to each pair in the DataFrame. """ val1 = x[0] val2 = x[1] # Do something to produce the result of comparing val1 and val2 # return the result return concat.apply(inner_apply, axis=1)

As an example, we will use this template to create a new comparison method. This method might not be terribly useful, but it should demonstrate how this template can be customixed for your own purposes. This method will first tokenize both strings (splitting at spaces). Then it will check that the first token from both strings appears in the other string. If both appear in the other a score of 1 is given. If neither appear a score of 0 is given. Finally, if only one appears in the other, a score of 0.5 is given.

def first_token(s1, s2): # This combines the columns you are comparing into a single DataFrame concat = pd.concat([s1, s2], axis=1, ignore_index=True) def apply_first_token(x): """ This is where your custom algorithm is housed. Create a function to be applied to each pair in the DataFrame. """ val1 = x[0] val2 = x[1] # Do something to produce the result of comparing val1 and val2 tkn1 = val1.split() tkn2 = val2.split() score = 0 if tkn1[0] in tkn2: score += 0.5 if tkn2[0] in tkn1: score += 0.5 # return the result return score return concat.apply(apply_first_token, axis=1)

The table below shows the resuls of applying this function to our dataset, showing the affiliation and name being compared as well as the resulting score. While the method itself isn’t very useful, it provides a good demonstration of how to define and use your own custom comparison function.

	Affiliation_l	name_r	first_token
(17, 1425)	michael smith laboratories university of british columbia vancouver britishcolumbia canada	phenomenome discoveries	0
(14, 2691)	department of surgery the university of british columbia vancouver bc canada	strathcona community hospital	0
(26, 39)	institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9	university of victoria	0.5
(21, 343)	institut de recherches cliniques de montreal montreal qc canada	canada council	0.5
(7, 798)	university of alberta 2 51 south academic building edmonton ab canada t6g2g7	university of northern british columbia	1
(7, 737)	university of alberta 2 51 south academic building edmonton ab canada t6g2g7	university of ottawa	1

What's next?

After you have compared attributes between a set of candidate links, you will want to determine which should be considered matches. This is done through classification. For an introduction to classification, check out the next tutorial in this series (coming soon …).

Futher reading

We highly recommend that you check out the recordlinkage documentation section on comparing records. For an in-depth look at each comparison method, take a look at the comparing page of the recordlinkage API Reference.

For more information on types of comparison metrics, we suggest reading Chapter 5 of Peter Christen’s book Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection Paperback. A shorter and more general overview of the process is described in Section 3.5 in Foster et al.’s book Big Data and Social Science: A Practical Guide to Methods and Tools.