Record comparison with recordlinkage

Tuesday, July 18, 2017
by Jillian Anderson

Welcome to the third installment of a five part tutorial series on the recordlinkage python package. Each tutorial will cover a specific stage of the data integration workflow. The topics and links for each tutorial are included below:

  1. Pre-processing with recordlinkage
  2. Indexing candidate links with recordlinkage
  3. Record comparison with recordlinkage
  4. Record pair classification with recordlinkage
  5. Data fusion (coming soon …)

Goal

By the end of this tutorial you should be comfortable using recordlinkage’s built-in methods and custom functions for record comparison.

Record comparison is an important part of the data integration workflow, as it provides the basis for classifying matches and the eventual fusion of data. Both of these steps will be covered in subsequent tutorials.

In this tutorial, we will compare two data sets. The first includes bibliographic data, which was retrieved using the metaknowledge python package. This data includes information on affiliations, papers, authors, etc. The second dataset is a collection of institutions from GRID. Both data sets have been filtered to only include institutions from Canada. Five rows from each of these datasets have been included below.

Affiliation data

  Affiliation Country
7 university of alberta 2 51 south academic building edmonton ab canada t6g2g7 Canada
1 departement de microbiologie infectiologie et immunologieuniversite de montreal montreal quebec h3c 3j7 canada electronic addressroxannecollinumontrealca Canada
10 getting to know cancer room 229a 36 arthur st truro nova scotia b2n 1x5canada Canada
17 michael smith laboratories university of british columbia vancouver britishcolumbia canada Canada
13 gastrointestinal research group and inflammation research network department of physiology and pharmacology calvin joan and phoebe snyder institute for chronicdiseases cumming school of medicine university of calgary calgary albertacanada Canada

GRID data

  name city state country
2457 holstein association of canada Brantford Ontario Canada
2499 yellow island aquaculture Victoria British Columbia Canada
3094 mitel Ottawa Ontario Canada
1370 tekion Toronto Ontario Canada
1030 toronto general hospital Toronto Ontario Canada

Comparing records

Preliminary work

There is a bit of work we need to do before we can compare records in our data sets. First, we need to load and pre-process the data. Then we will need to index the datasets to give us our list of candidate links. Each of these topics have been covered in detailed in previous tutorials (links at the top of the page). The script below contains the code used to prepare the data for comparisons.

import recordlinkage as rl
import pandas as pd

# ***************************
# Load Bib Data
# ***************************
bib = pd.read_csv("/path/to/canada_bib.csv") # Read bibliometric data
bib["Affiliation"] = clean(bib["Affiliation"]) # Clean Affiliation Data

# ***************************
# Load GRID Data
# ***************************
grid = pd.read_csv("/path/to/grid.csv") # Read GRID data
grid["name"] = clean(grid["name"]) # Clean GRID data

# ***************************
# Index with Full Index
# ***************************
indexer = rl.FullIndex()
candidate_links = indexer.index(bib, grid)

Set up the Compare object

In recordlinkage you must initiate a Compare object prior to performing any comparison functionality between records. This object stores both dataframes, the candidate links, and a vector containing comparison results. Further, the Compare object contains the methods for performing comparisons. The code block below initializes the comparison object.

compare = rl.Compare(candidate_links, bib, grid)

Built-in comparison methods

Now that we have initiated a comparison method we can go ahead and start comparing records. The simplest way to do comparisons is to use comparison methods that have been built-in to recordlinkage. Currently there are five specific comparison methods within recordlinkage: Compare.exact(), Compare.string(), Compare.numeric(), Compare.geo(), and Compare.date(). These are all well documented in the documentation. For this example we will be comparing the Affiliation column from the bibliometric dataset and the name column from the grid dataset.

We will start by using the Compare.exact() comparison method. This method is simple, if two values are an exact match a comparison score of 1 is returned, otherwise 0 is retured. Since the affiliation data we are trying to match is quite messy, this method returns very low scores. In fact, looking at the describe table included below the code block, you can see that no exact matches were found between the two columns.

Next, we will use the Compare.string() comparison method. This method is a bit more complicated and generates a score based on well known string-comparison algorithms. For this example, we will use the Levenshtein method. However, as shown in the describe table below, we still aren’t getting very high comparison scores, with the highest only being 0.52 (of a maximum 1.0).

# Use built-in comparison functions
compare.exact("Affiliation", "name", name="exact")
compare.string("Affiliation", "name", name="string")

# Print description
print(compare.vectors.describe())

  exact string
count 87920 87920
mean 0 0.152443
std 0 0.0587379
min 0 0
25% 0 0.111111
50% 0 0.15
75% 0 0.19
max 0 0.521739

Custom comparison methods

Based on the lack of high comparison scores, it doesn’t seem as though recordlinkage’s built-in methods are going to work for our current use case. Thankfully, there is a general comparison method built-in to recordlinkage called Compare.compare(). This method takes in a comparison function, two columns to compare, and the name of the column the score should be stored in. So, all we have to do is define and call a custom comparison function.

To demonstrate how the Compare.compare() method works, we will use a couple of custom comparison function that has already been defined in NetLab’s labutils package. We will use the normed_lcss() and normed_fuzzy_lcss() functions. normed_lcss computes a comparison score based on the length of the longest common substring between two strings. This score is normalized based on the length of the longest possible substring (the lesser of the two string lengths). normed_fuzzy_lcss() computes a score based on the prescence of similar substrings within the two strings. Once again this is normalized based on the length of the longest possible substring. For more information on this function see the documentation.

The code block below shows how each of these functions can be implemented using the generalized Compare.compare() method in recordlinkage.

# Import custom functions
from labutils import normed_lcss, normed_fuzzy_lcss

# Perform the comparison
compare.compare(normed_lcss, "Affiliation", "name", name="lcss")
compare.compare(normed_fuzzy_lcss, "Affiliation", "name", name="fuzzy_lcss")

# Print new description, including the 99th and 99.99th percentiles
compare.vectors.describe([0.99, 0.9999])

  exact string lcss fuzzy_lcss
count 87920 87920 87920 87920
mean 0 0.152443 0.184052 0.193961
std 0 0.0587379 0.0997552 0.102741
min 0 0 0 0
50% 0 0.15 0.163265 0.172414
99% 0 0.321429 0.5 0.527778
99.99% 0 0.5 0.7 0.875
max 0 0.521739 0.9 0.9

As we can see in the table above, both custom methods are resulting in higher comparison scores, but are still spread out between 0 and 1. The occurence of rare high scores helps to identify candidate links which likely correspond to true matches. The table below shows the results of comparisons on four pairs of records, two having low scores and two having high scores.

  Affiliation_l name_r exact string lcss fuzzy_lcss
(26, 2507) institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9 steel structures education foundation 0 0.202381 0.0810811 0.0810811
(22, 2225) transgenic core facility clinical research institute of montreal montreal qc canada grand river conservation authority 0 0.228916 0.0882353 0.0882353
(7, 8) university of alberta 2 51 south academic building edmonton ab canada t6g2g7 ualberta 0 0.105263 0.875 0.875
(17, 11) michael smith laboratories university of british columbia vancouver britishcolumbia canada university of british columbia 0 0.333333 1 1

Defining custom comparison methods

Once you know how to use custom comparison methods, the next important step is knowing how to create a custom method. There is additional information in the recordlinkage documentation, but I hope this example can provide some helpful additional information.

The code block below shows a template that can be used for create a custom comparison method. A custom comparison method takes in two Pandas.Series (columns) as well as any additional arguments that that specific function might need. To use the template below all you need to do is add to the inner_apply() method, writing the functionality to compare two values and produce a comparison score.

def my_compare(s1, s2):
    # This combines the columns you are comparing into a single DataFrame
    concat = pd.concat([s1, s2], axis=1, ignore_index=True)

    def inner_apply(x):
        """
        This is where your custom algorithm is housed.
        Create a function to be applied to each pair in the DataFrame.
        """
        val1 = x[0]
        val2 = x[1]

        # Do something to produce the result of comparing val1 and val2
        # return the result

    return concat.apply(inner_apply, axis=1)

As an example, we will use this template to create a new comparison method. This method might not be terribly useful, but it should demonstrate how this template can be customixed for your own purposes. This method will first tokenize both strings (splitting at spaces). Then it will check that the first token from both strings appears in the other string. If both appear in the other a score of 1 is given. If neither appear a score of 0 is given. Finally, if only one appears in the other, a score of 0.5 is given.

def first_token(s1, s2):
    # This combines the columns you are comparing into a single DataFrame
    concat = pd.concat([s1, s2], axis=1, ignore_index=True)

    def apply_first_token(x):
        """
        This is where your custom algorithm is housed.
        Create a function to be applied to each pair in the DataFrame.
        """
        val1 = x[0]
        val2 = x[1]

        # Do something to produce the result of comparing val1 and val2
        tkn1 = val1.split()
        tkn2 = val2.split()

        score = 0
        if tkn1[0] in tkn2:
            score += 0.5

        if tkn2[0] in tkn1:
            score += 0.5

        # return the result
        return score

    return concat.apply(apply_first_token, axis=1)

The table below shows the resuls of applying this function to our dataset, showing the affiliation and name being compared as well as the resulting score. While the method itself isn’t very useful, it provides a good demonstration of how to define and use your own custom comparison function.

  Affiliation_l name_r first_token
(17, 1425) michael smith laboratories university of british columbia vancouver britishcolumbia canada phenomenome discoveries 0
(14, 2691) department of surgery the university of british columbia vancouver bc canada strathcona community hospital 0
(26, 39) institute of parasitology mcgill university st anne de bellevue quebec canadah9x 3v9 university of victoria 0.5
(21, 343) institut de recherches cliniques de montreal montreal qc canada canada council 0.5
(7, 798) university of alberta 2 51 south academic building edmonton ab canada t6g2g7 university of northern british columbia 1
(7, 737) university of alberta 2 51 south academic building edmonton ab canada t6g2g7 university of ottawa 1

What's next?

After you have compared attributes between a set of candidate links, you will want to determine which should be considered matches. This is done through classification. For an introduction to classification, check out the next tutorial in this series (coming soon …).

Futher reading

We highly recommend that you check out the recordlinkage documentation section on comparing records. For an in-depth look at each comparison method, take a look at the comparing page of the recordlinkage API Reference.

For more information on types of comparison metrics, we suggest reading Chapter 5 of Peter Christen’s book Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection Paperback. A shorter and more general overview of the process is described in Section 3.5 in Foster et al.’s book Big Data and Social Science: A Practical Guide to Methods and Tools.