Getting started with metaknowledge | NetLab

Hi! This post is a slightly cleaned up version of the online supplement for the article I wrote on metaknowledge with Reid-McIlroy-Young. You can read “Introducing metaknowledge: Software for Computational Research in Information Science, Science of Science, and Network Analysis”. It includes an overview of our design philosophy as well as detailed examples of how to use metaknowledge to do what you want it to do. Unfortunately the article is currently behind a paywall. The open access version is coming soon! I’ll update this post with the link when it is available.

If you would prefer to work through this material in a Jupyter Notebook, you should download the original online supplement. It includes all the data from the article plus four Notebooks that do everything we do in the article.

If you want to learn more, here are some other posts in the metaknowledge tutorial series:

Network analysis with metaknowledge
Text analysis with metaknowledge
Historical bibliometrics with metaknowledge (with Reference Publication Year Spectroscopy)

Likely more to come later.

Getting started

Installing Python, the Scientific Stack, and metaknowledge

If you have not already done so, you will need to download and install the Anaconda distribution of Python 3, and the current public releases of metaknowledge and plotly. The other packages loaded below are included in the Anaconda distribution of Python 3.

Load packages

import metaknowledge as mk import pandas # for interactive graphs import plotly.plotly as py import plotly.graph_objs as go # for static graphs import matplotlib.pyplot as plt import seaborn as sns sns.set_style(style="white") # change the default background plot colour sns.set(font_scale=.75) plt.rc("savefig", dpi=400) # improve default resolution of graphics

If this is your first time using plot.ly, you will need to sign up for a free account and get an API key. You can learn how to get started. Alternatively, you can use another package such as Bokeh for interactive graphs, or simply skip interactivity and stick to static graphs using a package like Seaborn.

The GitHub repository contains the data used in this post. If you are following along with this tutorial, make sure that your filepaths point to the raw_data directory, or whatever you have named the directory that contains your data.

Creating and processing record collections

Parsing raw data to create records and record collections

All we need to do to create a Record Collection is provide the file path to the raw data. We will use the information science and bibliometrics dataset used in the “Introducing metaknowledge” article, which we have stored in a directory called raw_data/imetrics/.

RC = mk.RecordCollection('raw_data/imetrics/', cached = True) len(RC)

8140

We can easily write the full dataframe to a .csv file using the writeCSV method. This file can be used by any other research software. Let’s save it in a directory called generated_datasets.

RC.writeCSV('generated_datasets/records.csv')

Of course, it is also possible to continue working in Python. metaknowledge has some useful functions for working with Record Collections, but researchers can also use other Python packages such as Pandas.

The code block below uses the metaknowledge method yearSplit to extract the records published in 2013 and 2014 and shows the estimates for author gender. The process for estimating author genders uses birth record and name data. It is described in the article.

RC1314 = RC.yearSplit(2013, 2014) gender_breakdown = RC1314.genderStats() gender_breakdown

{'Female': 506, 'Male': 1349, 'Unknown': 1361}

The glimpse method in metaknowledge is a convenient way to quickly view the most frequently occurring authors and journals, and the most highly cited articles. It will print a quick summary to screen.

print(RC.glimpse())

RecordCollection glimpse made at: 2017-08-29 09:36:01 8140 Records from files-from-raw_data/imetrics/ Top Authors 1 Bornmann, Lutz 2 Leydesdorff, Loet 3 Thelwall, Mike 4 Rousseau, Ronald 5 SCHUBERT, A 6 DAngelo, Ciriaco Andrea 6 Abramo, Giovanni 7 Glanzel, Wolfgang 8 Glanzel, W 9 Huang, Mu-Hsuan 10 BRAUN, T 10 Lariviere, Vincent 11 Waltman, Ludo 12 Ding, Ying 13 Daniel, Hans-Dieter 14 Cronin, Blaise 15 Rousseau, R Top Journals 1 SCIENTOMETRICS 2 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY 3 JOURNAL OF INFORMETRICS 4 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY Top Cited 1 Hirsch JE, 2005, P NATL ACAD SCI USA, V102, P16569, DOI 10.1073/pnas.0507655102 2 Egghe L, 2006, SCIENTOMETRICS, V69, P131, DOI 10.1007/s11192-006-0144-7 3 SMALL H, 1973, J AM SOC INFORM SCI, V24, P265, DOI 10.1002/asi.4630240406 4 GARFIELD E, 1972, SCIENCE, V178, P471, DOI 10.1126/science.178.4060.471 5 MERTON RK, 1968, SCIENCE, V159, P56, DOI 10.1126/science.159.3810.56 5 Glanzel W, 2001, SCIENTOMETRICS, V51, P69, DOI 10.1023/A:1010512628145 6 Leydesdorff L, 2011, J AM SOC INF SCI TEC, V62, P846, DOI 10.1002/asi.21509 6 PRICE DJD, 1965, SCIENCE, V149, P510 7 Katz JS, 1997, RES POLICY, V26, P1, DOI 10.1016/S0048-7333(96)00917-1 8 Glanzel W, 2006, SCIENTOMETRICS, V67, P315, DOI 10.1556/Scient.67.2006.2.12 8 Wasserman S, 1994, SOCIAL NETWORK ANAL 9 Leydesdorff L, 2009, J AM SOC INF SCI TEC, V60, P348, DOI 10.1002/asi.20967 9 SCHUBERT A, 1986, SCIENTOMETRICS, V9, P281, DOI 10.1007/BF02017249 10 Moed H. F., 2005, CITATION ANAL RES EV 11 White HD, 1998, J AM SOC INFORM SCI, V49, P327, DOI 10.1002/(SICI)1097-4571(19980401)49:43.0.CO;2-W 11 De Solla Price Derek J., 1963, LITTLE SCI BIG SCI 12 Glanzel W, 2003, SCIENTOMETRICS, V56, P357, DOI 10.1023/A:1022378804087

While glimpse is useful for getting a quick sense of the most frequently appearing authors and journals, and the most highly cited documents, most research workflows require direct interaction with the data stored in the Record Collection. The easiest way to do this is to convert the Record Collection into a Pandas dataframe. This provides access to a wide range of methods for selecting, filtering, grouping, summarizing, modeling, and plotting data.

In addition to the Pandas documentation, researchers may want to consult Wes McKinney’s book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython or an online tutorial (e.g. Julia Evans’ Pandas cookbook).

df = pandas.DataFrame(RC.makeDict()) selectedVars = df[['AF', 'AB', 'PY', 'TI', 'SO', 'num-Authors', 'TC']] selectedVars[:5] # show the first 10 rows.

	AF	AB	PY	TI	SO	num-Authors	TC
0	[Bordons, M, Zulueta, MA, Romero, F, Barrigon, S]	A Multidisciplinary Research Programme (MRP) i...	1999	Measuring interdisciplinary collaboration with...	SCIENTOMETRICS	4	26
1	[Yan, Erjia, Guns, Raf]	This study examines collaboration dynamics wit...	2014	Predicting and recommending collaborations: An...	JOURNAL OF INFORMETRICS	2	7
2	[Davenport, Elisabeth]	None	2009	Everyday Information Practices: A Social Pheno...	JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO...	1	1
3	[Schubert, Andras, Korn, Andras, Telcs, Andras]	Hirsch-type indices are devised for characteri...	2009	Hirsch-type indices for characterizing networks	SCIENTOMETRICS	3	19
4	[Liang, LM, Kretschmer, H, Guo, YZ, Beaver, DD]	This paper is a scientometric study of the age...	2001	Age structures of scientific collaboration in...	SCIENTOMETRICS	4	20

The two letter variable names are the tags used by Web of Science. See the description of the content of each tag online. The code block above shows 7 tags that are typically of interest, but there are many others available.

We can sort these dataframes by any quantitative variable. Below, we extract the 40 most highly cited articles in the dataset. Rather than print all 40, we will use .head() to print to top 5.

top_40 = selectedVars.sort_values(['TC'], ascending = False)[:40] top_40.head()

	AF	AB	PY	TI	SO	num_Authors	TC
5760	[Egghe, Leo]	The g-index is introduced as an improvement of...	2006	Theory and practise of the g-index	SCIENTOMETRICS	1	538
7462	[Ho, YS]	This study presents a literature review concer...	2004	Citation review of Lagergren kinetic rate equa...	SCIENTOMETRICS	1	527
3788	[Liben-Nowell, David, Kleinberg, Jon]	Given a snapshot of a social network, can we i...	2007	The link-prediction problem for social networks	JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO...	2	437
1348	[Spink, A, Wolfram, D, Jansen, MBJ, Saracevic, T]	In studying actual Web searching by the public...	2001	Searching the Web: The public and their queries	JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO...	4	365
3088	[Meho, Lokman I., Yang, Kiduk]	The Institute for Scientific Information's (IS...	2007	Impact of data sources on citation counts and...	JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO...	2	315

We can create an interactive barchart with plot.ly very easily from this dataframe. In general, barcharts are probably the least interesting graphs to make interactive, but in this case it is useful because hovering over the bars displays the full title information (click on the graph first).

trace = go.Bar( x=top_40['TI'], y=top_40['TC'] ) data = [trace] layout = go.Layout( yaxis=dict( title='Times Cited', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='times-cited')

Bar chart showing number of citations for each full title.

Click on the graph to see it on plot.ly.

If you are producing a static graph, it makes sense to shorten the titles. Otherwise the y-axis label would be as wide as the longest article title. We will create a new column called short_title that will contain the first 20 characters in the title. We can then use this new variable to create a horizontal bar graph with reasonable labels on the y-axis.

top_40['short_title'] = top_40['TI'].str[:20] top_40[['TI', 'short_title', 'TC']].head()

	TI	short_title	TC
5760	Theory and practise of the g-index	Theory and practise	538
7462	Citation review of Lagergren kinetic rate equa...	Citation review of L	527
3788	The link-prediction problem for social networks	The link-prediction	437
1348	Searching the Web: The public and their queries	Searching the Web: T	365
3088	Impact of data sources on citation counts and...	Impact of data sourc	315

with sns.axes_style("white"): horizontal_bar = sns.barplot(data = top_40, x = 'TC', y = 'short_title', color = 'gray') horizontal_bar.set(xlabel='Number of Citations', ylabel='') sns.despine(left = True, right = True, bottom = True, top = True) plt.tight_layout() plt.savefig('figures/horizontal_barplot.png')

Bar chart showing number of citations for each short title.

Creating time series datasets

It is also easy to plot time series graphs, for example of article publications over time. We can do this by using the timeSeries method in metaknowledge, and converting the results into a Pandas dataframe that can be easily plotted. We will write the time series dataset to a csv file at the same time.

#[2:] removes incomplete data from 2016 growth = pandas.DataFrame(RC.timeSeries('year', outputFile = 'generated_datasets/growth.csv'))[2:] growth[:10]

	count	entry	year
2	643	2014	2014
3	572	2013	2013
4	545	2012	2012
5	494	2011	2011
6	535	2010	2010
7	467	2009	2009
8	389	2008	2008
9	395	2007	2007
10	353	2006	2006
11	267	2005	2005

We can easily create an interactive line graph of this dataframe.

trace = go.Scatter( x = growth['year'], y = growth['count'], mode = 'lines+markers', name = 'lines+markers' ) data = [trace] layout = go.Layout( yaxis=dict( title='Number of Publications', ), xaxis=dict( title='Year', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='growth-over-time')

Line chart of the number of publications over time.

Click on the graph to see it on plot.ly.

Of course, it would be a bit more interesting to compare the number of publications over time for each of the journals in the dataset, so let’s do that.

growth_by_journal = pandas.DataFrame(RC.timeSeries('journal', outputFile = 'generated_datasets/growth_journals.csv')) scientometrics = growth_by_journal[(growth_by_journal.entry == 'SCIENTOMETRICS') & (growth_by_journal.year informetrics = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF INFORMETRICS') & (growth_by_journal.year jaist = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY') & (growth_by_journal.year jasist = growth_by_journal[(growth_by_journal.entry == 'JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY') & (growth_by_journal.year

trace1 = go.Scatter( x = scientometrics['year'], y = scientometrics['count'], mode = 'lines+markers', name = 'Scientometrics' ) trace2 = go.Scatter( x = informetrics['year'], y = informetrics['count'], mode = 'lines+markers', name = 'Journal of Informetrics' ) trace3 = go.Scatter( x = jaist['year'], y = jaist['count'], mode = 'lines+markers', name = 'JAIST (New Name)' ) trace4 = go.Scatter( x = jasist['year'], y = jasist['count'], mode = 'lines+markers', name = 'JASIST (Old Name)' ) data = [trace1, trace2, trace3, trace4] layout = go.Layout( yaxis=dict( title='Number of Publications', ), xaxis=dict( title='Year', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='growth-over-time-multiples')

Click on this graph and hover over to show us the number of publications for each journal for each year. Note that although the first issue of Journal of the American Society for Information Science and Technology came out in 1950, the Web of Science does not have meta-data on the journal before 2001.

Line chart of the number of publications for each journal for each year.

Click on the graph to see it on plot.ly.

Next time, network analysis with metaknowledge!