Pre-processing with recordlinkage

Wednesday, July 5, 2017
by Jillian Anderson

Welcome to the first installment of a five part tutorial series on the recordlinkage python package. Each tutorial will cover a specific stage of the data integration workflow. The topics and links for each tutorial are included below:

  1. Pre-processing with recordlinkage
  2. Indexing candidate links with recordlinkage
  3. Record comparison with recordlinkage
  4. Record pair classification with recordlinkage
  5. Data fusion (coming soon …)

Data pre-processing

Data pre-processing is a crucial step in any data analysis project, and record linkage problems are no different. The main goal of pre-processing is to transform messy data into something that is usable in an analysis workflow. In some senses, the entire process of record linkage can be considered part of the data pre-processing phase as it is preparing data for analysis. However, the first part of record linkage involves cleaning and standardizing the data. This can be relatively simple, such as removing extra white-space or more complex, such as parsing and standardizing address strings.

While cleaning and standardizing can be performed without the use of specialized packages, the recordlinkage python package has cleaning and standardizing functionality built-in. In this tutorial, we will explore how this functionality can be used during the pre-processing phase of data integration.

Goals

By the end of this tutorial you should have an understanding of how to use the recordlinkage package to perform basic cleaning and standardization of Panda’s DataFrames in python.

To get the most out of the tutorials, I recommend you follow along by writing your own code in the python environment of your choice (script, console, notebook, etc). Writing the code out can help you become familiar with the syntax and functions within recordlinkage.

Before you start

Before you start reading through the tutorial it is important that you have some background knowledge. I recommend you check out the following resources:

  • recordlinkage’s standardise module documentation for an introduction to the functions we will be covering.
  • RegexOne for an introduction (or quick refresher) on regular expressions.

Pre-processing with recordlinkage

recordlinkage’s standardise sub-module includes four built-in functions. Each of these functions tackle a different aspect of pre-processing. Below, we introduce potential uses for each of these functions, provide a basic code example, and present a couple of small problems to solve. For a full overview of the functions, read the documentation. The examples below make use of a small dataset. The table below shows the data in its raw form.

name phone_number occupation address
1. Rachel Green 1(613)555 0149 buyer (fashion) “90 Bedford Street, Apt 20”
2. Ross Geller +1-613-555-0138 paleontogist “100 Grove Street, Apartment 16”
3. Mönica Geller 16135550185 Chef “90 Bedford Street, Apt 20”
4. Chandler BING 1 613 555 0161 ??? “90 Bedford Street, Apt 19”
5. Pheobe Buffay 1(613)5550114 musician “5 Morton Street, Apt. 14”
6. Joseph (Joey) Tribbiani 1(613)555-0148 actor “90 Bedford Street, Apt 19”

clean()

The clean() function is used to clean a single column within a data frame, where the column contains strings. By default, clean() will turn all strings into lowercase and remove characters such as quotation marks and punctuation. This function is especially useful for cleaning columns containing attributes such as names, addresses, abstracts, unstructured text, etc. It is less useful for cleaning columns containing attributes such as phone numbers or lists.

Although not discussed in the documentation, the clean() function also automatically removes leading and trailing white-space and collapses consecutive white-space into a single space.

Example

In this example we will first use default parameters to clean the name column and store it as a new column. Additionally, we will provide a custom replace_by_none argument to clean the occupation column, removing the brackets and their content and placing the result in a new column. The code block and resulting data frame are shown below.

from recordlinkage.standardise import clean
import pandas as pd

df = pd.read_csv("friends.csv")

# Default Cleaning
df["name_clean_default"] = clean(df["name"])

# Clean the `occupation` column, but keep brackets and their contents.
df["occupation_clean"]= clean(df["occupation"],
                              replace_by_none='[^ \\-\\_A-Za-z0-9]+',
                              remove_brackets=False)

name phone_number occupation address name_clean_default occupation_clean
1. Rachel Green 1(613)555 0149 buyer (fashion) “90 Bedford Street, Apt 20” 1 rachel green buyer (fashion)
2. Ross Geller +1-613-555-0138 paleontogist “100 Grove Street, Apartment 16” 2 ross geller paleontogist
3. Mönica Geller 16135550185 Chef “90 Bedford Street, Apt 20” 3 mnica geller chef
4. Chandler BING 1 613 555 0161 ??? “90 Bedford Street, Apt 19” 4 chandler bing  
5. Pheobe Buffay 1(613)5550114 musician “5 Morton Street, Apt. 14” 5 pheobe buffay musician
6. Joseph (Joey) Tribbiani 1(613)555-0148 actor “90 Bedford Street, Apt 19” 6 joseph tribbiani actor

Test your knowledge

Try writing code to perform the following tasks:

  • Clean and strip accents from the name column. Replace the uncleaned column with the result.
  • Clean the name column, removing all instances of numbers from the field. Replace the uncleaned column with the result.

phonenumbers()

The phonenumbers() function is used to remove all non-number characters from a cell, excluding the ‘+’ character. To also remove ‘+’ you can use a custom version of the clean() function.

Example

The example below cleans the phone_number column and replaces the old column with the result.

# Clean the phone_number column and replaces the old column with the result.
from recordlinkage.standardise import phonenumbers
import pandas as pd

df = pd.read_csv("friends.csv")

df["phone_number"]= phonenumbers(df["phone_number"])

name phone_number occupation address
1. Rachel Green 16135550149 buyer (fashion) “90 Bedford Street, Apt 20”
2. Ross Geller +16135550138 paleontogist “100 Grove Street, Apartment 16”
3. Mönica Geller 16135550185 Chef “90 Bedford Street, Apt 20”
4. Chandler BING 16135550161 ??? “90 Bedford Street, Apt 19”
5. Pheobe Buffay 16135550114 musician “5 Morton Street, Apt. 14”
6. Joseph (Joey) Tribbiani 16135550148 actor “90 Bedford Street, Apt 19”

value_occurence()

The value_occurence function is used to count the number of times each item in a column occurs. This is a very general function, that can be used in many different situations. My impression of the function is that it is most useful for solving data specific problems and extracting features from those data. It could also be used to determine whether a column contains unique values and can act as a candidate key. The function could also be used to assign a summary value to each attribute, such as the proportion of the dataset which has the same value as an observation.

Example

In the example below, value_occurence() is used to find the size (in people) of each individual’s household.

from recordlinkage.standardise import value_occurence

from recordlinkage.standardise import value_occurence
import pandas as pd

df = pd.read_csv("friends.csv")

df["household_size"] = value_occurence(df["address"])

name phone_number occupation address household_size
1. Rachel Green 1(613)555 0149 buyer (fashion) “90 Bedford Street, Apt 20” 2
2. Ross Geller +1-613-555-0138 paleontogist “100 Grove Street, Apartment 16” 1
3. Mönica Geller 16135550185 Chef “90 Bedford Street, Apt 20” 2
4. Chandler BING 1 613 555 0161 ??? “90 Bedford Street, Apt 19” 2
5. Pheobe Buffay 1(613)5550114 musician “5 Morton Street, Apt. 14” 1
6. Joseph (Joey) Tribbiani 1(613)555-0148 actor “90 Bedford Street, Apt 19” 2

Test your knowledge

Try writing code to find the number of roommates for each individual. Place the result in a new column.

phonetic()

The phonetic() function is used to convert strings into their corresponding phonetic codes. This is particularly useful when comparing names where different possible spellings make it difficult to find exact matches (Ex. Jillian and Gillian).

Note that some phonetic algorithms have been created specifically for use with English names. Check out the algorithm before choosing whether to use for your own project. All the algorithms have trade-offs, some have increased accuracy for specific languages, some build in understanding of non-English names, some run faster, etc.

Example

The example below will standardize the name column, resulting in a new column called phonetic, where each of the names have been standardized using the nysiis phonetic algorithm.

from recordlinkage.standardise import phonetic
import pandas as pd

# Read in the data
df = pd.read_csv("friends.csv")

# Clean the name column to remove numbers and strip accents
df["name"]= clean(df["name"], replace_by_none='[^ \\-\\_A-Za-z]+', strip_accents="unicode")

# Standardize using the nysiis phonetic algorithm
df["phonetic"] = phonetic(df["name"], method="nysiis")

name phone_number occupation address phonetic
rachel green 1(613)555 0149 buyer (fashion) “90 Bedford Street, Apt 20” RACALGRAN
ross geller +1-613-555-0138 paleontogist “100 Grove Street, Apartment 16” RASGALAR
monica geller 16135550185 Chef “90 Bedford Street, Apt 20” MANACAGALAR
chandler bing 1 613 555 0161 ??? “90 Bedford Street, Apt 19” CANDLARBANG
pheobe buffay 1(613)5550114 musician “5 Morton Street, Apt. 14” FABABAFY
joseph tribbiani 1(613)555-0148 actor “90 Bedford Street, Apt 19” JASAFTRABAN

Test your knowledge

Try writing code to perform the following tasks:

  • Standardize the name column using the metaphone algorithm, having not removed white-space prior to applying the phonetic function. Place the result in a new column called name_metaphone.
  • Compare the results of all four phonetic algorithms. Create a new column for the result of each algorithm.

What's next?

After you have sufficiently pre-processed and standardized your data, you will want to begin the record linking process. The first step is to create a set of candidate links through a process called indexing. For an introduction to indexing, check out the next tutorial in this series indexing candidate links with recordlinkage.