Hi! This post is a slightly cleaned up version of the online supplement for the article I wrote on metaknowledge with Reid-McIlroy-Young. You can read “Introducing metaknowledge: Software for Computational Research in Information Science, Science of Science, and Network Analysis”. It includes an overview of our design philosophy as well as detailed examples of how to use metaknowledge to do what you want it to do. Unfortunately the article is currently behind a paywall. The open access version is coming soon! I’ll update this post with the link when it is available.
If you would prefer to work through this material in a Jupyter Notebook, you should download the original online supplement. It includes all the data from the article plus four Notebooks that do everything we do in the article.
Getting started
Installing Python, the Scientific Stack, and metaknowledge
If you have not already done so, you will need to download and install the Anaconda distribution of Python 3, and the current public releases of metaknowledge and plotly. The other packages loaded below are included in the Anaconda distribution of Python 3.
Load packages
import metaknowledge as mk
import pandas
# for interactive graphs
import plotly.plotly as py
import plotly.graph_objs as go
# for static graphs
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style(style="white") # change the default background plot colour
plt.rc("savefig", dpi=400) # improve default resolution of graphics
If this is your first time using plot.ly, you will need to sign up for a free account and get an API key. You can learn how to get started. Alternatively, you can use another package such as Bokeh for interactive graphs, or simply skip interactivity and stick to static graphs using a package like Seaborn.
The GitHub repository contains the data used in this post. If you are following along with this tutorial, make sure that your filepaths point to the raw_data
directory, or whatever you have named the directory that contains your data.
Creating and processing record collections
Parsing raw data to create records and record collections
All we need to do to create a Record Collection
is provide the file path to the raw data. We will use the information science and bibliometrics dataset used in the “Introducing metaknowledge” article, which we have stored in a directory called raw_data/imetrics/
RC = mk.RecordCollection('raw_data/imetrics/', cached = True)
We can easily write the full dataframe to a .csv
file using the writeCSV
method. This file can be used by any other research software. Let’s save it in a directory called generated_datasets
Of course, it is also possible to continue working in Python. metaknowledge has some useful functions for working with Record Collections
, but researchers can also use other Python packages such as Pandas.
The code block below uses the metaknowledge method yearSplit
to extract the records published in 2013 and 2014 and shows the estimates for author gender. The process for estimating author genders uses birth record and name data. It is described in the article.
RC1314 = RC.yearSplit(2013, 2014)
gender_breakdown = RC1314.genderStats()
{'Female': 506, 'Male': 1349, 'Unknown': 1361}
The glimpse
method in metaknowledge is a convenient way to quickly view the most frequently occurring authors and journals, and the most highly cited articles. It will print a quick summary to screen.
RecordCollection glimpse made at: 2017-08-29 09:36:01
8140 Records from files-from-raw_data/imetrics/
Top Authors
1 Bornmann, Lutz
2 Leydesdorff, Loet
3 Thelwall, Mike
4 Rousseau, Ronald
6 DAngelo, Ciriaco Andrea
6 Abramo, Giovanni
7 Glanzel, Wolfgang
8 Glanzel, W
9 Huang, Mu-Hsuan
10 Lariviere, Vincent
11 Waltman, Ludo
12 Ding, Ying
13 Daniel, Hans-Dieter
14 Cronin, Blaise
15 Rousseau, R
Top Journals
Top Cited
1 Hirsch JE, 2005, P NATL ACAD SCI USA, V102, P16569, DOI 10.1073/pnas.0507655102
2 Egghe L, 2006, SCIENTOMETRICS, V69, P131, DOI 10.1007/s11192-006-0144-7
3 SMALL H, 1973, J AM SOC INFORM SCI, V24, P265, DOI 10.1002/asi.4630240406
4 GARFIELD E, 1972, SCIENCE, V178, P471, DOI 10.1126/science.178.4060.471
5 MERTON RK, 1968, SCIENCE, V159, P56, DOI 10.1126/science.159.3810.56
5 Glanzel W, 2001, SCIENTOMETRICS, V51, P69, DOI 10.1023/A:1010512628145
6 Leydesdorff L, 2011, J AM SOC INF SCI TEC, V62, P846, DOI 10.1002/asi.21509
6 PRICE DJD, 1965, SCIENCE, V149, P510
7 Katz JS, 1997, RES POLICY, V26, P1, DOI 10.1016/S0048-7333(96)00917-1
8 Glanzel W, 2006, SCIENTOMETRICS, V67, P315, DOI 10.1556/Scient.67.2006.2.12
8 Wasserman S, 1994, SOCIAL NETWORK ANAL
9 Leydesdorff L, 2009, J AM SOC INF SCI TEC, V60, P348, DOI 10.1002/asi.20967
9 SCHUBERT A, 1986, SCIENTOMETRICS, V9, P281, DOI 10.1007/BF02017249
10 Moed H. F., 2005, CITATION ANAL RES EV
11 White HD, 1998, J AM SOC INFORM SCI, V49, P327, DOI 10.1002/(SICI)1097-4571(19980401)49:43.0.CO;2-W
11 De Solla Price Derek J., 1963, LITTLE SCI BIG SCI
12 Glanzel W, 2003, SCIENTOMETRICS, V56, P357, DOI 10.1023/A:1022378804087
While glimpse
is useful for getting a quick sense of the most frequently appearing authors and journals, and the most highly cited documents, most research workflows require direct interaction with the data stored in the Record Collection
. The easiest way to do this is to convert the Record Collection
into a Pandas dataframe. This provides access to a wide range of methods for selecting, filtering, grouping, summarizing, modeling, and plotting data.
In addition to the Pandas documentation, researchers may want to consult Wes McKinney’s book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython or an online tutorial (e.g. Julia Evans’ Pandas cookbook).
df = pandas.DataFrame(RC.makeDict())
selectedVars = df[['AF', 'AB', 'PY', 'TI', 'SO', 'num-Authors', 'TC']]
selectedVars[:5] # show the first 10 rows.
AF | AB | PY | TI | SO | num-Authors | TC | |
0 | [Bordons, M, Zulueta, MA, Romero, F, Barrigon, S] | A Multidisciplinary Research Programme (MRP) i... | 1999 | Measuring interdisciplinary collaboration with... | SCIENTOMETRICS | 4 | 26 |
1 | [Yan, Erjia, Guns, Raf] | This study examines collaboration dynamics wit... | 2014 | Predicting and recommending collaborations: An... | JOURNAL OF INFORMETRICS | 2 | 7 |
2 | [Davenport, Elisabeth] | None | 2009 | Everyday Information Practices: A Social Pheno... | JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... | 1 | 1 |
3 | [Schubert, Andras, Korn, Andras, Telcs, Andras] | Hirsch-type indices are devised for characteri... | 2009 | Hirsch-type indices for characterizing networks | SCIENTOMETRICS | 3 | 19 |
4 | [Liang, LM, Kretschmer, H, Guo, YZ, Beaver, DD] | This paper is a scientometric study of the age... | 2001 | Age structures of scientific collaboration in... | SCIENTOMETRICS | 4 | 20 |
The two letter variable names are the tags used by Web of Science. See the description of the content of each tag online. The code block above shows 7 tags that are typically of interest, but there are many others available.
We can sort these dataframes by any quantitative variable. Below, we extract the 40 most highly cited articles in the dataset. Rather than print all 40, we will use .head()
to print to top 5.
top_40 = selectedVars.sort_values(['TC'], ascending = False)[:40]
AF | AB | PY | TI | SO | num_Authors | TC | |
5760 | [Egghe, Leo] | The g-index is introduced as an improvement of... | 2006 | Theory and practise of the g-index | SCIENTOMETRICS | 1 | 538 |
7462 | [Ho, YS] | This study presents a literature review concer... | 2004 | Citation review of Lagergren kinetic rate equa... | SCIENTOMETRICS | 1 | 527 |
3788 | [Liben-Nowell, David, Kleinberg, Jon] | Given a snapshot of a social network, can we i... | 2007 | The link-prediction problem for social networks | JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... | 2 | 437 |
1348 | [Spink, A, Wolfram, D, Jansen, MBJ, Saracevic, T] | In studying actual Web searching by the public... | 2001 | Searching the Web: The public and their queries | JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... | 4 | 365 |
3088 | [Meho, Lokman I., Yang, Kiduk] | The Institute for Scientific Information's (IS... | 2007 | Impact of data sources on citation counts and... | JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... | 2 | 315 |
We can create an interactive barchart with plot.ly very easily from this dataframe. In general, barcharts are probably the least interesting graphs to make interactive, but in this case it is useful because hovering over the bars displays the full title information (click on the graph first).
trace = go.Bar(
data = [trace]
layout = go.Layout(
title='Times Cited',
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='times-cited')
Click on the graph to see it on plot.ly.
If you are producing a static graph, it makes sense to shorten the titles. Otherwise the y-axis label would be as wide as the longest article title. We will create a new column called short_title
that will contain the first 20 characters in the title. We can then use this new variable to create a horizontal bar graph with reasonable labels on the y-axis.
top_40['short_title'] = top_40['TI'].str[:20]
top_40[['TI', 'short_title', 'TC']].head()
TI | short_title | TC | |
5760 | Theory and practise of the g-index | Theory and practise | 538 |
7462 | Citation review of Lagergren kinetic rate equa... | Citation review of L | 527 |
3788 | The link-prediction problem for social networks | The link-prediction | 437 |
1348 | Searching the Web: The public and their queries | Searching the Web: T | 365 |
3088 | Impact of data sources on citation counts and... | Impact of data sourc | 315 |
with sns.axes_style("white"):
horizontal_bar = sns.barplot(data = top_40, x = 'TC', y = 'short_title', color = 'gray')
horizontal_bar.set(xlabel='Number of Citations', ylabel='')
sns.despine(left = True, right = True, bottom = True, top = True)

Creating time series datasets
It is also easy to plot time series graphs, for example of article publications over time. We can do this by using the timeSeries
method in metaknowledge, and converting the results into a Pandas dataframe that can be easily plotted. We will write the time series dataset to a csv
file at the same time.
#[2:] removes incomplete data from 2016
growth = pandas.DataFrame(RC.timeSeries('year', outputFile = 'generated_datasets/growth.csv'))[2:]
count | entry | year | |
2 | 643 | 2014 | 2014 |
3 | 572 | 2013 | 2013 |
4 | 545 | 2012 | 2012 |
5 | 494 | 2011 | 2011 |
6 | 535 | 2010 | 2010 |
7 | 467 | 2009 | 2009 |
8 | 389 | 2008 | 2008 |
9 | 395 | 2007 | 2007 |
10 | 353 | 2006 | 2006 |
11 | 267 | 2005 | 2005 |
We can easily create an interactive line graph of this dataframe.
trace = go.Scatter(
x = growth['year'],
y = growth['count'],
mode = 'lines+markers',
name = 'lines+markers'
data = [trace]
layout = go.Layout(
title='Number of Publications',
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='growth-over-time')
Click on the graph to see it on plot.ly.
Of course, it would be a bit more interesting to compare the number of publications over time for each of the journals in the dataset, so let’s do that.
growth_by_journal = pandas.DataFrame(RC.timeSeries('journal', outputFile = 'generated_datasets/growth_journals.csv'))
scientometrics = growth_by_journal[(growth_by_journal.entry == 'SCIENTOMETRICS') & (growth_by_journal.year
informetrics = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF INFORMETRICS') & (growth_by_journal.year
jaist = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY') & (growth_by_journal.year
jasist = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY') & (growth_by_journal.year
trace1 = go.Scatter(
x = scientometrics['year'],
y = scientometrics['count'],
mode = 'lines+markers',
name = 'Scientometrics'
trace2 = go.Scatter(
x = informetrics['year'],
y = informetrics['count'],
mode = 'lines+markers',
name = 'Journal of Informetrics'
trace3 = go.Scatter(
x = jaist['year'],
y = jaist['count'],
mode = 'lines+markers',
name = 'JAIST (New Name)'
trace4 = go.Scatter(
x = jasist['year'],
y = jasist['count'],
mode = 'lines+markers',
name = 'JASIST (Old Name)'
data = [trace1, trace2, trace3, trace4]
layout = go.Layout(
title='Number of Publications',
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='growth-over-time-multiples')
Click on this graph and hover over to show us the number of publications for each journal for each year. Note that although the first issue of Journal of the American Society for Information Science and Technology came out in 1950, the Web of Science does not have meta-data on the journal before 2001.
Click on the graph to see it on plot.ly.
Next time, network analysis with metaknowledge!