Getting started with metaknowledge

Tuesday, August 29, 2017
by John McLevey

Hi! This post is a slightly cleaned up version of the online supplement for the article I wrote on metaknowledge with Reid-McIlroy-Young. You can read “Introducing metaknowledge: Software for Computational Research in Information Science, Science of Science, and Network Analysis”. It includes an overview of our design philosophy as well as detailed examples of how to use metaknowledge to do what you want it to do. Unfortunately the article is currently behind a paywall. The open access version is coming soon! I’ll update this post with the link when it is available.

If you would prefer to work through this material in a Jupyter Notebook, you should download the original online supplement. It includes all the data from the article plus four Notebooks that do everything we do in the article.

If you want to learn more, here are some other posts in the metaknowledge tutorial series:

  • Network analysis with metaknowledge
  • Text analysis with metaknowledge
  • Historical bibliometrics with metaknowledge (with Reference Publication Year Spectroscopy)

Likely more to come later.

Getting started

Installing Python, the Scientific Stack, and metaknowledge

If you have not already done so, you will need to download and install the Anaconda distribution of Python 3, and the current public releases of metaknowledge and plotly. The other packages loaded below are included in the Anaconda distribution of Python 3.

Load packages

import metaknowledge as mk
import pandas

# for interactive graphs
import plotly.plotly as py
import plotly.graph_objs as go

# for static graphs
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style(style="white") # change the default background plot colour
sns.set(font_scale=.75)
plt.rc("savefig", dpi=400) # improve default resolution of graphics

If this is your first time using plot.ly, you will need to sign up for a free account and get an API key. You can learn how to get started. Alternatively, you can use another package such as Bokeh for interactive graphs, or simply skip interactivity and stick to static graphs using a package like Seaborn.

The GitHub repository contains the data used in this post. If you are following along with this tutorial, make sure that your filepaths point to the raw_data directory, or whatever you have named the directory that contains your data.

Creating and processing record collections

Parsing raw data to create records and record collections

All we need to do to create a Record Collection is provide the file path to the raw data. We will use the information science and bibliometrics dataset used in the “Introducing metaknowledge” article, which we have stored in a directory called raw_data/imetrics/.

RC = mk.RecordCollection('raw_data/imetrics/', cached = True)
len(RC)

8140

We can easily write the full dataframe to a .csv file using the writeCSV method. This file can be used by any other research software. Let’s save it in a directory called generated_datasets.

RC.writeCSV('generated_datasets/records.csv')

Of course, it is also possible to continue working in Python. metaknowledge has some useful functions for working with Record Collections, but researchers can also use other Python packages such as Pandas.

The code block below uses the metaknowledge method yearSplit to extract the records published in 2013 and 2014 and shows the estimates for author gender. The process for estimating author genders uses birth record and name data. It is described in the article.

RC1314 = RC.yearSplit(2013, 2014)
gender_breakdown = RC1314.genderStats()
gender_breakdown

{'Female': 506, 'Male': 1349, 'Unknown': 1361}

The glimpse method in metaknowledge is a convenient way to quickly view the most frequently occurring authors and journals, and the most highly cited articles. It will print a quick summary to screen.

print(RC.glimpse())

RecordCollection glimpse made at: 2017-08-29 09:36:01
8140 Records from files-from-raw_data/imetrics/

Top Authors
1 Bornmann, Lutz
2 Leydesdorff, Loet
3 Thelwall, Mike
4 Rousseau, Ronald
5 SCHUBERT, A
6 DAngelo, Ciriaco Andrea
6 Abramo, Giovanni
7 Glanzel, Wolfgang
8 Glanzel, W
9 Huang, Mu-Hsuan
10 BRAUN, T
10 Lariviere, Vincent
11 Waltman, Ludo
12 Ding, Ying
13 Daniel, Hans-Dieter
14 Cronin, Blaise
15 Rousseau, R

Top Journals
1 SCIENTOMETRICS
2 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY
3 JOURNAL OF INFORMETRICS
4 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY

Top Cited
1 Hirsch JE, 2005, P NATL ACAD SCI USA, V102, P16569, DOI 10.1073/pnas.0507655102
2 Egghe L, 2006, SCIENTOMETRICS, V69, P131, DOI 10.1007/s11192-006-0144-7
3 SMALL H, 1973, J AM SOC INFORM SCI, V24, P265, DOI 10.1002/asi.4630240406
4 GARFIELD E, 1972, SCIENCE, V178, P471, DOI 10.1126/science.178.4060.471
5 MERTON RK, 1968, SCIENCE, V159, P56, DOI 10.1126/science.159.3810.56
5 Glanzel W, 2001, SCIENTOMETRICS, V51, P69, DOI 10.1023/A:1010512628145
6 Leydesdorff L, 2011, J AM SOC INF SCI TEC, V62, P846, DOI 10.1002/asi.21509
6 PRICE DJD, 1965, SCIENCE, V149, P510
7 Katz JS, 1997, RES POLICY, V26, P1, DOI 10.1016/S0048-7333(96)00917-1
8 Glanzel W, 2006, SCIENTOMETRICS, V67, P315, DOI 10.1556/Scient.67.2006.2.12
8 Wasserman S, 1994, SOCIAL NETWORK ANAL
9 Leydesdorff L, 2009, J AM SOC INF SCI TEC, V60, P348, DOI 10.1002/asi.20967
9 SCHUBERT A, 1986, SCIENTOMETRICS, V9, P281, DOI 10.1007/BF02017249
10 Moed H. F., 2005, CITATION ANAL RES EV
11 White HD, 1998, J AM SOC INFORM SCI, V49, P327, DOI 10.1002/(SICI)1097-4571(19980401)49:43.0.CO;2-W
11 De Solla Price Derek J., 1963, LITTLE SCI BIG SCI
12 Glanzel W, 2003, SCIENTOMETRICS, V56, P357, DOI 10.1023/A:1022378804087

While glimpse is useful for getting a quick sense of the most frequently appearing authors and journals, and the most highly cited documents, most research workflows require direct interaction with the data stored in the Record Collection. The easiest way to do this is to convert the Record Collection into a Pandas dataframe. This provides access to a wide range of methods for selecting, filtering, grouping, summarizing, modeling, and plotting data.

In addition to the Pandas documentation, researchers may want to consult Wes McKinney’s book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython or an online tutorial (e.g. Julia Evans’ Pandas cookbook).

df = pandas.DataFrame(RC.makeDict())
selectedVars = df[['AF', 'AB', 'PY', 'TI', 'SO', 'num-Authors', 'TC']]
selectedVars[:5] # show the first 10 rows.

  AF AB PY TI SO num-Authors TC
0 [Bordons, M, Zulueta, MA, Romero, F, Barrigon, S] A Multidisciplinary Research Programme (MRP) i... 1999 Measuring interdisciplinary collaboration with... SCIENTOMETRICS 4 26
1 [Yan, Erjia, Guns, Raf] This study examines collaboration dynamics wit... 2014 Predicting and recommending collaborations: An... JOURNAL OF INFORMETRICS 2 7
2 [Davenport, Elisabeth] None 2009 Everyday Information Practices: A Social Pheno... JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... 1 1
3 [Schubert, Andras, Korn, Andras, Telcs, Andras] Hirsch-type indices are devised for characteri... 2009 Hirsch-type indices for characterizing networks SCIENTOMETRICS 3 19
4 [Liang, LM, Kretschmer, H, Guo, YZ, Beaver, DD] This paper is a scientometric study of the age... 2001 Age structures of scientific collaboration in... SCIENTOMETRICS 4 20

The two letter variable names are the tags used by Web of Science. See the description of the content of each tag online. The code block above shows 7 tags that are typically of interest, but there are many others available.

We can sort these dataframes by any quantitative variable. Below, we extract the 40 most highly cited articles in the dataset. Rather than print all 40, we will use .head() to print to top 5.

top_40 = selectedVars.sort_values(['TC'], ascending = False)[:40]
top_40.head()

  AF AB PY TI SO num_Authors TC
5760 [Egghe, Leo] The g-index is introduced as an improvement of... 2006 Theory and practise of the g-index SCIENTOMETRICS 1 538
7462 [Ho, YS] This study presents a literature review concer... 2004 Citation review of Lagergren kinetic rate equa... SCIENTOMETRICS 1 527
3788 [Liben-Nowell, David, Kleinberg, Jon] Given a snapshot of a social network, can we i... 2007 The link-prediction problem for social networks JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... 2 437
1348 [Spink, A, Wolfram, D, Jansen, MBJ, Saracevic, T] In studying actual Web searching by the public... 2001 Searching the Web: The public and their queries JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... 4 365
3088 [Meho, Lokman I., Yang, Kiduk] The Institute for Scientific Information's (IS... 2007 Impact of data sources on citation counts and... JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... 2 315

We can create an interactive barchart with plot.ly very easily from this dataframe. In general, barcharts are probably the least interesting graphs to make interactive, but in this case it is useful because hovering over the bars displays the full title information (click on the graph first).

trace = go.Bar(
            x=top_40['TI'],
            y=top_40['TC']
    )

data = [trace]

layout = go.Layout(
    yaxis=dict(
        title='Times Cited',
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='times-cited')

Click on the graph to see it on plot.ly.

If you are producing a static graph, it makes sense to shorten the titles. Otherwise the y-axis label would be as wide as the longest article title. We will create a new column called short_title that will contain the first 20 characters in the title. We can then use this new variable to create a horizontal bar graph with reasonable labels on the y-axis.

top_40['short_title'] = top_40['TI'].str[:20]
top_40[['TI', 'short_title', 'TC']].head()

  TI short_title TC
5760 Theory and practise of the g-index Theory and practise 538
7462 Citation review of Lagergren kinetic rate equa... Citation review of L 527
3788 The link-prediction problem for social networks The link-prediction 437
1348 Searching the Web: The public and their queries Searching the Web: T 365
3088 Impact of data sources on citation counts and... Impact of data sourc 315

with sns.axes_style("white"):
    horizontal_bar = sns.barplot(data = top_40, x = 'TC', y = 'short_title', color = 'gray')
    horizontal_bar.set(xlabel='Number of Citations', ylabel='')
    sns.despine(left = True, right = True, bottom = True, top = True)
    plt.tight_layout()
plt.savefig('figures/horizontal_barplot.png')

Bar chart showing number of citations for each short title.

Creating time series datasets

It is also easy to plot time series graphs, for example of article publications over time. We can do this by using the timeSeries method in metaknowledge, and converting the results into a Pandas dataframe that can be easily plotted. We will write the time series dataset to a csv file at the same time.

#[2:] removes incomplete data from 2016
growth = pandas.DataFrame(RC.timeSeries('year', outputFile = 'generated_datasets/growth.csv'))[2:]
growth[:10]

  count entry year
2 643 2014 2014
3 572 2013 2013
4 545 2012 2012
5 494 2011 2011
6 535 2010 2010
7 467 2009 2009
8 389 2008 2008
9 395 2007 2007
10 353 2006 2006
11 267 2005 2005

We can easily create an interactive line graph of this dataframe.

trace = go.Scatter(
    x = growth['year'],
    y = growth['count'],
    mode = 'lines+markers',
    name = 'lines+markers'
)

data = [trace]

layout = go.Layout(
    yaxis=dict(
        title='Number of Publications',
    ),
    xaxis=dict(
        title='Year',
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='growth-over-time')

Click on the graph to see it on plot.ly.

Of course, it would be a bit more interesting to compare the number of publications over time for each of the journals in the dataset, so let’s do that.

growth_by_journal = pandas.DataFrame(RC.timeSeries('journal', outputFile = 'generated_datasets/growth_journals.csv'))


scientometrics = growth_by_journal[(growth_by_journal.entry == 'SCIENTOMETRICS') & (growth_by_journal.year informetrics = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF INFORMETRICS') & (growth_by_journal.year jaist = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY') & (growth_by_journal.year jasist = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY') & (growth_by_journal.year

trace1 = go.Scatter(
    x = scientometrics['year'],
    y = scientometrics['count'],
    mode = 'lines+markers',
    name = 'Scientometrics'
)
trace2 = go.Scatter(
    x = informetrics['year'],
    y = informetrics['count'],
    mode = 'lines+markers',
    name = 'Journal of Informetrics'
)
trace3 = go.Scatter(
    x = jaist['year'],
    y = jaist['count'],
    mode = 'lines+markers',
    name = 'JAIST (New Name)'
)
trace4 = go.Scatter(
    x = jasist['year'],
    y = jasist['count'],
    mode = 'lines+markers',
    name = 'JASIST (Old Name)'
)

data = [trace1, trace2, trace3, trace4]

layout = go.Layout(
    yaxis=dict(
        title='Number of Publications',
    ),
    xaxis=dict(
        title='Year',
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='growth-over-time-multiples')

Click on this graph and hover over to show us the number of publications for each journal for each year. Note that although the first issue of Journal of the American Society for Information Science and Technology came out in 1950, the Web of Science does not have meta-data on the journal before 2001.

Click on the graph to see it on plot.ly.

Next time, network analysis with metaknowledge!