Introduction
Over the summer, NetLab’s development team is working to contribute to a data integration tool to facilitate research which combines disparate datasets. Before beginning our development work, we did some research into what is currently available. We found a few competing tools, though we felt none of them fully met our goals. However, a python tool called recordlinkage shows significant potential. So, taking a page from xkcd (see the comic below), we have decided to build on top of recordlinkage rather than develop our own competing tool.
This tutorial is meant to introduce features currently implemented in recordlinkage as well as some of the added functionality the NetLab developers have begun to implement. This notebook applies recordlinkage to some of the problems faced by our IDI dataset. First, we will use recordlinkage to assign location information to universities within our dataset. Second, we will use a similar process to find and merge duplicate entities in our IDI dataset.
Before you begin
Before running through this notebook you should make sure you have the necessary python packages installed on your machine: pandas and recordlinkage. You can install pandas by running: pip install pandas
in terminal. Similarly, run: pip install recordlinkage
in terminal to install the recordlinkage package.
Define functions
Below are some preliminary functions that the NetLab developers have defined to build on the functionality provided by recordlinkage
. Additional and up-to-date functions can be found in NetLab’s labutils
repository on GitHub.
Task 1 — Assign countries
First, we are going to link the idi_table and university datasets together in an attempt to find country data for institutions in our IDI dataset. To do this, we will compare the "Full Name"
variable from the idi_table dataset with the "name1"
, "name2"
, and "name3"
variables from the university dataset.
To begin, we read in the idi_table data and use recordlinkage’s standardise
module to quickly clean the dataset. Then, for the sake of time, we will filter the data to only include those entities that begin with the letter ‘z’. The first five rows from this dataset are shown below.
index | Full Name |
---|---|
494 | zurich university hospital |
559 | zagazig university |
648 | zentrum innere medizin und deutsches zentrum |
1237 | zhejiang academy of agriculture sciences |
1253 | zambia emory hiv research program |
Next, we can read in the university data and once again use the standardise
module to clean the variables of interest. We will also filter the dataset to only include universities that begin with the letter ‘z’. The first five rows from this dataset are shown below.
index | name1 | name2 | name3 | addr1 | addr2 | addr3 | addr4 | pop |
---|---|---|---|---|---|---|---|---|
117 | zsigmond kirly foiskola | king sigismund college | none | Kelta u.2. | Budapest | 1139 Budapest | Hungary | “1,000,000-5,000,000” |
178 | ziauddin university | none | none | “4/B, Shahra-e-Ghalib, Block-6” | Karachi | 75600 Sindh | Pakistan | “over 5,000,000” |
204 | zhaotong university | none | “Yunnan Road, Zhaoyang District” | Zhaotong | 657000 Yunnan | China | “over 5,000,000” | |
393 | zyegin niversitesi | zyegin university | none | Kusbakisi Cad. No 2 | Istanbul | 34662 Istanbul | Turkey | “over 5,000,000” |
409 | zhejiang normal university | none | None | Jinhua | 524088 Zhejiang | China | “1,000,000-5,000,000” |
Once both datasets have been read in, we will use recordlinkage to index the data. This essentially creates candidate pairs that we can later evaluate. In this case, because each of our datasets is relatively small we will use a full index. This will result an index with all possible pairs of records.
Once we have a list of pairs we want to compare, we can go ahead and perform our comparisons. The code block below shows how built-in recordlinkage functionality can be used to do this. Upcoming tutorials will explore how to create your own custom comparison methods, but for now we will just stick with the built-in ones.
Then, once the comparisons have been made we will classify our comparisons to get a list of matches. The code below uses a really simplistic method of filter, where if any of the comparisons resulted in a score higher than 0.75 we will classify it as a match. Other more complicated classification methods exist and some machine learning approaches are also built-in to recordlinkage.
Now that we have a list of matches we can fuse our dataset. The code block below uses one of the functions defined at the beginning of the document to fuse the datasets together. The fusion below results in a table where each IDI record has been kept. Then, the information from the highest scoring match for each record is added to the table. Those records who did not have a match are still included, but their additional columns are left blank. Five rows of the fused dataset are shown below.
index | Full Name | name1 | addr1 | addr2 | addr3 | addr4 |
---|---|---|---|---|---|---|
2617 | zambia emory hiv research project | |||||
3241 | zhejiang university | zhejiang university | “38 Zheda Road, Xihu” | Hangzhou | 310027 Zhejiang | China |
4468 | zhongkai university of agriculuture | |||||
6863 | zahedan university of medical sciences | zahedan university of medical sciences | None | Zahedan | 98135 Sistan and Baluchestan | Iran |
559 | zagazig university | zagazig university | None | Zagazig | 44519 Sharqia | Egypt |
Task 2 — Find duplicate entities
Second, we are going to work on finding duplicate entities within a subset of the IDI dataset. We will use the dataset’s “Full Name” column, which contains named entities which were identified by Stanford’s NLP toolkit.
To begin, we will read in the IDI data from its csv and use recordlinkage’s standardise
module to clean the dataset. Then, in the interest of time we will filter the data to only include entities which begin with the letter ‘a’. The first five rows of this dataset are shown below.
index | Full Name |
---|---|
52 | ahvaz jundishapur university of medical sciences |
62 | animal et gestion integree des risques |
86 | anhui province hospital affiliated anhui medical university |
100 | aberdeen university |
109 | arabian gulf university |
Since we are doing deduplication, we will only need to read in a single dataset. Despite this, recordlinkage operates in essentially the same way as we saw previously. The following code block indexes the data, but does so using the sorted neighbourhood method. To read about the differences between a full index and sorted neighbourhood index, read recordlinkage’s documentation.
Once again, following indexing, we will compare our pairs and then classify the results. The process here follows the same steps we used in Task 1 above. The only major differences are that we will use a threshold of 0.9 for match classification and we will make use of the show_duplicates()
function to display the pairs classified as matches. This table is shown below.
index | A_Full Name | B_Full Name |
---|---|---|
0 | ahvaz jundishapur university of medical sciences | ahvaz jondishapur university of medical sciences |
1 | anhui academy of agricultural sciences | anhui academy of agriculture sciences |
2 | albert ludwigs university of freiburg | albert ludwigs university freiburg |
3 | advanced science and technology | advanced science and biotechnology |
4 | 309th hospital of chinese people s liberation army | 307th hospital of chinese people s liberation army |
5 | applied life sciences | applied life sciences |
6 | awadesh pratap singh university | awadhesh pratap singh university |
7 | affiliated zhongshan hospital of fudan university | affiliated zhongshan hospital of dalian university |
8 | alpert school of medicine of brown university | alpert school of medicine brown university |
9 | australia school of biological sciences | australia school of biomedical sciences |
10 | aix marseille universite um | aix marseille universite um2 |
11 | academy of scientific and innovative research | academy of scientific innovative research |
12 | agriculture organization of the united nations | agriculture organization of the united nations fao |
13 | agricultural research service | agricultural research service |
14 | aids programme of research south africa | aids program of research south africa |
Next, we will actually go about deduplicating the dataset. This uses the deduplicate()
function included at the beginning of this tutorial. Note this function is very much a work in progress and won’t work seamlessly with other datasets. The code to perform this is included below.
Summary
That concludes the overview of recordlinkage and how it might apply to data of interest to academic labs such as NetLab. Look out for new tutorials introducing the different aspects of recordlinkage, from cleaning, to classifying, to fusing.