Introduction
Over the summer, NetLab’s development team is working to contribute to a data integration tool to facilitate research which combines disparate datasets. Before beginning our development work, we did some research into what is currently available. We found a few competing tools, though we felt none of them fully met our goals. However, a python tool called recordlinkage shows significant potential. So, taking a page from xkcd (see the comic below), we have decided to build on top of recordlinkage rather than develop our own competing tool.
![A comic strip with the title](/networks-lab/sites/default/files/uploads/images/standards.png)
This tutorial is meant to introduce features currently implemented in recordlinkage as well as some of the added functionality the NetLab developers have begun to implement. This notebook applies recordlinkage to some of the problems faced by our IDI dataset. First, we will use recordlinkage to assign location information to universities within our dataset. Second, we will use a similar process to find and merge duplicate entities in our IDI dataset.
Before you begin
Before running through this notebook you should make sure you have the necessary python packages installed on your machine: pandas and recordlinkage. You can install pandas by running: pip install pandas
in terminal. Similarly, run: pip install recordlinkage
in terminal to install the recordlinkage package.
Define functions
Below are some preliminary functions that the NetLab developers have defined to build on the functionality provided by recordlinkage
. Additional and up-to-date functions can be found in NetLab’s labutils
repository on GitHub.
def fuse(dfA, dfB, dfmatches, authoritative="dfA"):
newDF = pd.DataFrame()
for row in dfmatches.iterrows():
indexA = row[0][0]
DictA = dfA.loc[indexA].to_dict()
indexB = row[0][1]
DictB = dfB.loc[indexB].to_dict()
newDict = {}
if authoritative == "dfB":
for item in DictA:
newDict[item] = DictA[item]
for item in DictB:
newDict[item] = DictB[item]
elif authoritative == "dfA":
for item in DictB:
newDict[item] = DictB[item]
for item in DictA:
newDict[item] = DictA[item]
newSeries = newDict
newDF = newDF.append(newSeries, ignore_index=True)
return newDF
def left_fuse(dfA, dfB, dfmatches):
# Keeps all the rows in dfA, matches each to at most one item in dfB
newDF = dfA.copy()
columns = dfB.columns.values
for col in columns:
newDF[col] = newDF.apply(lambda _: '', axis=1)
for row in dfmatches.iterrows():
indexA = row[0][0]
indexB = row[0][1]
for col in columns:
newDF.loc[indexA][col] = dfB.loc[indexB][col]
return newDF
def deduplicate(df, dfmatches, columns_append=[]):
newDF = df.copy()
for row in dfmatches.iterrows():
indexA = row[0][0]
indexB = row[0][1]
for col in columns_append:
old_set = set(newDF.loc[indexA][col].replace("[", "")
.replace("]", "")
.replace("'", "")
.split(", "))
new_set = set(df.loc[indexB][col].replace("[", "")
.replace("]", "")
.replace("'", "")
.split(", "))
old_set = old_set.union(new_set)
newDF.loc[indexA][col] = str(list(old_set))
newDF = newDF.drop(indexB) # The dropping is going to be problematic with triplicates, etc.
return newDF
def show_duplicates(df, dfmatches):
newDF = pd.DataFrame()
for row in dfmatches.iterrows():
indexA = row[0][0]
DictA = df.loc[indexA].to_dict()
indexB = row[0][1]
DictB = df.loc[indexB].to_dict()
newDict = {}
for item in DictA:
col_name = "A_" + item
newDict[col_name] = DictA[item]
for item in DictB:
col_name = "B_" + item
newDict[col_name] = DictB[item]
newSeries = newDict
newDF = newDF.append(newSeries, ignore_index=True)
return newDF
Task 1 — Assign countries
First, we are going to link the idi_table and university datasets together in an attempt to find country data for institutions in our IDI dataset. To do this, we will compare the "Full Name"
variable from the idi_table dataset with the "name1"
, "name2"
, and "name3"
variables from the university dataset.
To begin, we read in the idi_table data and use recordlinkage’s standardise
module to quickly clean the dataset. Then, for the sake of time, we will filter the data to only include those entities that begin with the letter ‘z’. The first five rows from this dataset are shown below.
import pandas as pd
import recordlinkage as rl
from recordlinkage.standardise import clean
# ************
# Get IDI Data
# ************
# Load the idi data
idi = pd.read_csv("idi_table.csv", encoding='latin-1')
# Clean the Full Name column (removes brackets, lowercases, removes accents, etc)
idi["Full Name"] = clean(idi["Full Name"], remove_brackets=False, strip_accents='ascii')
# Filter the data to only include entities that begin with z
# (For the sake of time)
z_idi = idi[idi['Full Name'] > "z"]
index | Full Name |
---|---|
494 | zurich university hospital |
559 | zagazig university |
648 | zentrum innere medizin und deutsches zentrum |
1237 | zhejiang academy of agriculture sciences |
1253 | zambia emory hiv research program |
Next, we can read in the university data and once again use the standardise
module to clean the variables of interest. We will also filter the dataset to only include universities that begin with the letter ‘z’. The first five rows from this dataset are shown below.
# *******************
# Get University Data
# *******************
# Read in University Data
uni = pd.read_csv("universities.csv")
# clean the data (alternative names are included)
uni["name1"] = clean(uni["name1"])
uni["name2"] = clean(uni["name2"])
uni["name3"] = clean(uni["name3"])
# Filter the data to only include universities that begin with z
# (For the sake of time)
z_uni = uni[uni['name1'] > "z"]
index | name1 | name2 | name3 | addr1 | addr2 | addr3 | addr4 | pop |
---|---|---|---|---|---|---|---|---|
117 | zsigmond kirly foiskola | king sigismund college | none | Kelta u.2. | Budapest | 1139 Budapest | Hungary | “1,000,000-5,000,000” |
178 | ziauddin university | none | none | “4/B, Shahra-e-Ghalib, Block-6” | Karachi | 75600 Sindh | Pakistan | “over 5,000,000” |
204 | zhaotong university | none | “Yunnan Road, Zhaoyang District” | Zhaotong | 657000 Yunnan | China | “over 5,000,000” | |
393 | zyegin niversitesi | zyegin university | none | Kusbakisi Cad. No 2 | Istanbul | 34662 Istanbul | Turkey | “over 5,000,000” |
409 | zhejiang normal university | none | None | Jinhua | 524088 Zhejiang | China | “1,000,000-5,000,000” |
Once both datasets have been read in, we will use recordlinkage to index the data. This essentially creates candidate pairs that we can later evaluate. In this case, because each of our datasets is relatively small we will use a full index. This will result an index with all possible pairs of records.
# *****
# Index
# *****
# Find all possible entity/university combinations
# Note: this uses recordlinkage 0.9's updated interface
pairs = rl.FullIndex()
pairs = pairs.index(z_idi, z_uni)
Once we have a list of pairs we want to compare, we can go ahead and perform our comparisons. The code block below shows how built-in recordlinkage functionality can be used to do this. Upcoming tutorials will explore how to create your own custom comparison methods, but for now we will just stick with the built-in ones.
Then, once the comparisons have been made we will classify our comparisons to get a list of matches. The code below uses a really simplistic method of filter, where if any of the comparisons resulted in a score higher than 0.75 we will classify it as a match. Other more complicated classification methods exist and some machine learning approaches are also built-in to recordlinkage.
# *******
# Compare
# *******
# Compare the entity name with the primary university name
compare_c1 = rl.Compare(pairs, z_idi, z_uni)
compare_c1.string("Full Name", "name1", name="Name1")
compare_c1.string("Full Name", "name2", name="Name2")
compare_c1.string("Full Name", "name3", name="Name3")
# ********
# Classify
# ********
# Compile list of matched pairs
matches = compare_c1.vectors[compare_c1.vectors.max(axis=1) > 0.75]
matches = matches.sort_values("Name1")
Now that we have a list of matches we can fuse our dataset. The code block below uses one of the functions defined at the beginning of the document to fuse the datasets together. The fusion below results in a table where each IDI record has been kept. Then, the information from the highest scoring match for each record is added to the table. Those records who did not have a match are still included, but their additional columns are left blank. Five rows of the fused dataset are shown below.
# ****
# Fuse
# ****
# Use a RAD developed function to fuse the data (keeps all rows in z_idi)
fused = left_fuse(z_idi, z_uni, matches) #fuse
fused = fused[["Full Name", "name1", "addr1", "addr2", "addr3", "addr4"]] #select columns
index | Full Name | name1 | addr1 | addr2 | addr3 | addr4 |
---|---|---|---|---|---|---|
2617 | zambia emory hiv research project | |||||
3241 | zhejiang university | zhejiang university | “38 Zheda Road, Xihu” | Hangzhou | 310027 Zhejiang | China |
4468 | zhongkai university of agriculuture | |||||
6863 | zahedan university of medical sciences | zahedan university of medical sciences | None | Zahedan | 98135 Sistan and Baluchestan | Iran |
559 | zagazig university | zagazig university | None | Zagazig | 44519 Sharqia | Egypt |
Task 2 — Find duplicate entities
Second, we are going to work on finding duplicate entities within a subset of the IDI dataset. We will use the dataset’s “Full Name” column, which contains named entities which were identified by Stanford’s NLP toolkit.
To begin, we will read in the IDI data from its csv and use recordlinkage’s standardise
module to clean the dataset. Then, in the interest of time we will filter the data to only include entities which begin with the letter ‘a’. The first five rows of this dataset are shown below.
# ************
# Get IDI Data
# ************
# Read in the idi data from the csv
idi = pd.read_csv("idi_table.csv", encoding='latin-1')
# Clean the idi data
idi["Full Name"] = clean(idi["Full Name"], remove_brackets=False, strip_accents='ascii')
# Filter the data to only include entities that begin with a
a_idi = idi[idi['Full Name'] > "a"]
a_idi = idi[idi['Full Name']
index | Full Name |
---|---|
52 | ahvaz jundishapur university of medical sciences |
62 | animal et gestion integree des risques |
86 | anhui province hospital affiliated anhui medical university |
100 | aberdeen university |
109 | arabian gulf university |
Since we are doing deduplication, we will only need to read in a single dataset. Despite this, recordlinkage operates in essentially the same way as we saw previously. The following code block indexes the data, but does so using the sorted neighbourhood method. To read about the differences between a full index and sorted neighbourhood index, read recordlinkage’s documentation.
# *****
# Index
# *****
pairs = rl.SortedNeighbourhoodIndex(on="Full Name", window=5)
pairs = pairs.index(a_idi)
Once again, following indexing, we will compare our pairs and then classify the results. The process here follows the same steps we used in Task 1 above. The only major differences are that we will use a threshold of 0.9 for match classification and we will make use of the show_duplicates()
function to display the pairs classified as matches. This table is shown below.
# *******
# Compare
# *******
# Compare entity names with one another
compare_c2 = rl.Compare(pairs, a_idi, a_idi) #set up compare object
compare_c2.string("Full Name", "Full Name", name="Full Name") #compare
# *****
# Match
# *****
# Compile list of matched pairs, using threshold of 0.9
matches = compare_c2.vectors[compare_c2.vectors.sum(axis=1)>0.9].sort_index() # You can play with the threshold (0.9)
# Print out the matches found
show_matches = show_duplicates(a_idi, matches)
show_matches[['A_Full Name', 'B_Full Name']]
index | A_Full Name | B_Full Name |
---|---|---|
0 | ahvaz jundishapur university of medical sciences | ahvaz jondishapur university of medical sciences |
1 | anhui academy of agricultural sciences | anhui academy of agriculture sciences |
2 | albert ludwigs university of freiburg | albert ludwigs university freiburg |
3 | advanced science and technology | advanced science and biotechnology |
4 | 309th hospital of chinese people s liberation army | 307th hospital of chinese people s liberation army |
5 | applied life sciences | applied life sciences |
6 | awadesh pratap singh university | awadhesh pratap singh university |
7 | affiliated zhongshan hospital of fudan university | affiliated zhongshan hospital of dalian university |
8 | alpert school of medicine of brown university | alpert school of medicine brown university |
9 | australia school of biological sciences | australia school of biomedical sciences |
10 | aix marseille universite um | aix marseille universite um2 |
11 | academy of scientific and innovative research | academy of scientific innovative research |
12 | agriculture organization of the united nations | agriculture organization of the united nations fao |
13 | agricultural research service | agricultural research service |
14 | aids programme of research south africa | aids program of research south africa |
Next, we will actually go about deduplicating the dataset. This uses the deduplicate()
function included at the beginning of this tutorial. Note this function is very much a work in progress and won’t work seamlessly with other datasets. The code to perform this is included below.
# ***********
# Deduplicate
# ***********
# Use a RAD developed function to merge duplicates.
# Authors, PMID, City, Country, and Type are aggregated into a single list.
deduplicated = deduplicate(a_idi, matches)
Summary
That concludes the overview of recordlinkage and how it might apply to data of interest to academic labs such as NetLab. Look out for new tutorials introducing the different aspects of recordlinkage, from cleaning, to classifying, to fusing.