Welcome to the third installment of a five part tutorial series on the recordlinkage python package. Each tutorial will cover a specific stage of the data integration workflow. The topics and links for each tutorial are included below:
- Pre-processing with recordlinkage
- Indexing candidate links with recordlinkage
- Record comparison with recordlinkage
- Record pair classification with recordlinkage
- Data fusion (coming soon …)
Goal
By the end of this tutorial you should be comfortable using recordlinkage’s built-in methods and custom functions for record comparison.
Record comparison is an important part of the data integration workflow, as it provides the basis for classifying matches and the eventual fusion of data. Both of these steps will be covered in subsequent tutorials.
In this tutorial, we will compare two data sets. The first includes bibliographic data, which was retrieved using the metaknowledge python package. This data includes information on affiliations, papers, authors, etc. The second dataset is a collection of institutions from GRID. Both data sets have been filtered to only include institutions from Canada. Five rows from each of these datasets have been included below.
Affiliation data
Affiliation | Country | |
---|---|---|
7 | university of alberta 2 51 south academic building edmonton ab canada t6g2g7 | Canada |
1 | departement de microbiologie infectiologie et immunologieuniversite de montreal montreal quebec h3c 3j7 canada electronic addressroxannecollinumontrealca | Canada |
10 | getting to know cancer room 229a 36 arthur st truro nova scotia b2n 1x5canada | Canada |
17 | michael smith laboratories university of british columbia vancouver britishcolumbia canada | Canada |
13 | gastrointestinal research group and inflammation research network department of physiology and pharmacology calvin joan and phoebe snyder institute for chronicdiseases cumming school of medicine university of calgary calgary albertacanada | Canada |
GRID data
name | city | state | country | |
---|---|---|---|---|
2457 | holstein association of canada | Brantford | Ontario | Canada |
2499 | yellow island aquaculture | Victoria | British Columbia | Canada |
3094 | mitel | Ottawa | Ontario | Canada |
1370 | tekion | Toronto | Ontario | Canada |
1030 | toronto general hospital | Toronto | Ontario | Canada |
Comparing records
Preliminary work
There is a bit of work we need to do before we can compare records in our data sets. First, we need to load and pre-process the data. Then we will need to index the datasets to give us our list of candidate links. Each of these topics have been covered in detailed in previous tutorials (links at the top of the page). The script below contains the code used to prepare the data for comparisons.
Set up the Compare object
In recordlinkage you must initiate a Compare
object prior to performing any comparison functionality between records. This object stores both dataframes, the candidate links, and a vector containing comparison results. Further, the Compare
object contains the methods for performing comparisons. The code block below initializes the comparison object.
Built-in comparison methods
Now that we have initiated a comparison method we can go ahead and start comparing records. The simplest way to do comparisons is to use comparison methods that have been built-in to recordlinkage. Currently there are five specific comparison methods within recordlinkage: Compare.exact()
, Compare.string()
, Compare.numeric()
, Compare.geo()
, and Compare.date()
. These are all well documented in the documentation. For this example we will be comparing the Affiliation
column from the bibliometric dataset and the name
column from the grid dataset.
We will start by using the Compare.exact()
comparison method. This method is simple, if two values are an exact match a comparison score of 1 is returned, otherwise 0 is retured. Since the affiliation data we are trying to match is quite messy, this method returns very low scores. In fact, looking at the describe table included below the code block, you can see that no exact matches were found between the two columns.
Next, we will use the Compare.string()
comparison method. This method is a bit more complicated and generates a score based on well known string-comparison algorithms. For this example, we will use the Levenshtein method. However, as shown in the describe table below, we still aren’t getting very high comparison scores, with the highest only being 0.52 (of a maximum 1.0).
exact | string | |
---|---|---|
count | 87920 | 87920 |
mean | 0 | 0.152443 |
std | 0 | 0.0587379 |
min | 0 | 0 |
25% | 0 | 0.111111 |
50% | 0 | 0.15 |
75% | 0 | 0.19 |
max | 0 | 0.521739 |
Custom comparison methods
Based on the lack of high comparison scores, it doesn’t seem as though recordlinkage’s built-in methods are going to work for our current use case. Thankfully, there is a general comparison method built-in to recordlinkage called Compare.compare()
. This method takes in a comparison function, two columns to compare, and the name of the column the score should be stored in. So, all we have to do is define and call a custom comparison function.
To demonstrate how the Compare.compare()
method works, we will use a couple of custom comparison function that has already been defined in NetLab’s labutils
package. We will use the normed_lcss()
and normed_fuzzy_lcss()
functions. normed_lcss
computes a comparison score based on the length of the longest common substring between two strings. This score is normalized based on the length of the longest possible substring (the lesser of the two string lengths). normed_fuzzy_lcss()
computes a score based on the prescence of similar substrings within the two strings. Once again this is normalized based on the length of the longest possible substring. For more information on this function see the documentation.
The code block below shows how each of these functions can be implemented using the generalized Compare.compare()
method in recordlinkage.
exact | string | lcss | fuzzy_lcss | |
---|---|---|---|---|
count | 87920 | 87920 | 87920 | 87920 |
mean | 0 | 0.152443 | 0.184052 | 0.193961 |
std | 0 | 0.0587379 | 0.0997552 | 0.102741 |
min | 0 | 0 | 0 | 0 |
50% | 0 | 0.15 | 0.163265 | 0.172414 |
99% | 0 | 0.321429 | 0.5 | 0.527778 |
99.99% | 0 | 0.5 | 0.7 | 0.875 |
max | 0 | 0.521739 | 0.9 | 0.9 |
As we can see in the table above, both custom methods are resulting in higher comparison scores, but are still spread out between 0 and 1. The occurence of rare high scores helps to identify candidate links which likely correspond to true matches. The table below shows the results of comparisons on four pairs of records, two having low scores and two having high scores.
Affiliation_l | name_r | exact | string | lcss | fuzzy_lcss | |
---|---|---|---|---|---|---|
(26, 2507) | institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9 | steel structures education foundation | 0 | 0.202381 | 0.0810811 | 0.0810811 |
(22, 2225) | transgenic core facility clinical research institute of montreal montreal qc canada | grand river conservation authority | 0 | 0.228916 | 0.0882353 | 0.0882353 |
(7, 8) | university of alberta 2 51 south academic building edmonton ab canada t6g2g7 | ualberta | 0 | 0.105263 | 0.875 | 0.875 |
(17, 11) | michael smith laboratories university of british columbia vancouver britishcolumbia canada | university of british columbia | 0 | 0.333333 | 1 | 1 |
Defining custom comparison methods
Once you know how to use custom comparison methods, the next important step is knowing how to create a custom method. There is additional information in the recordlinkage documentation, but I hope this example can provide some helpful additional information.
The code block below shows a template that can be used for create a custom comparison method. A custom comparison method takes in two Pandas.Series
(columns) as well as any additional arguments that that specific function might need. To use the template below all you need to do is add to the inner_apply()
method, writing the functionality to compare two values and produce a comparison score.
As an example, we will use this template to create a new comparison method. This method might not be terribly useful, but it should demonstrate how this template can be customixed for your own purposes. This method will first tokenize both strings (splitting at spaces). Then it will check that the first token from both strings appears in the other string. If both appear in the other a score of 1 is given. If neither appear a score of 0 is given. Finally, if only one appears in the other, a score of 0.5 is given.
The table below shows the resuls of applying this function to our dataset, showing the affiliation and name being compared as well as the resulting score. While the method itself isn’t very useful, it provides a good demonstration of how to define and use your own custom comparison function.
Affiliation_l | name_r | first_token | |
---|---|---|---|
(17, 1425) | michael smith laboratories university of british columbia vancouver britishcolumbia canada | phenomenome discoveries | 0 |
(14, 2691) | department of surgery the university of british columbia vancouver bc canada | strathcona community hospital | 0 |
(26, 39) | institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9 | university of victoria | 0.5 |
(21, 343) | institut de recherches cliniques de montreal montreal qc canada | canada council | 0.5 |
(7, 798) | university of alberta 2 51 south academic building edmonton ab canada t6g2g7 | university of northern british columbia | 1 |
(7, 737) | university of alberta 2 51 south academic building edmonton ab canada t6g2g7 | university of ottawa | 1 |
What's next?
After you have compared attributes between a set of candidate links, you will want to determine which should be considered matches. This is done through classification. For an introduction to classification, check out the next tutorial in this series (coming soon …).
Futher reading
We highly recommend that you check out the recordlinkage documentation section on comparing records. For an in-depth look at each comparison method, take a look at the comparing page of the recordlinkage API Reference.
For more information on types of comparison metrics, we suggest reading Chapter 5 of Peter Christen’s book Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection Paperback. A shorter and more general overview of the process is described in Section 3.5 in Foster et al.’s book Big Data and Social Science: A Practical Guide to Methods and Tools.