Hi! This post is a slightly cleaned up version of the online supplement for the article I wrote on metaknowledge with Reid-McIlroy-Young. You can read “Introducing metaknowledge: Software for Computational Research in Information Science, Science of Science, and Network Analysis”. It includes an overview of our design philosophy as well as detailed examples of how to use metaknowledge to do what you want it to do. Unfortunately the article is currently behind a paywall. The open access version is coming soon! I’ll update this post with the link when it is available.
If you would prefer to work through this material in a Jupyter Notebook, you should download the original online supplement. It includes all the data from the article plus four Notebooks that do everything we do in the article.
If you want to learn more, here are some other posts in the metaknowledge tutorial series:
- Network analysis with metaknowledge
- Text analysis with metaknowledge
- Historical bibliometrics with metaknowledge (with Reference Publication Year Spectroscopy)
Likely more to come later.
Getting started
Installing Python, the Scientific Stack, and metaknowledge
If you have not already done so, you will need to download and install the Anaconda distribution of Python 3, and the current public releases of metaknowledge and plotly. The other packages loaded below are included in the Anaconda distribution of Python 3.
Load packages
If this is your first time using plot.ly, you will need to sign up for a free account and get an API key. You can learn how to get started. Alternatively, you can use another package such as Bokeh for interactive graphs, or simply skip interactivity and stick to static graphs using a package like Seaborn.
The GitHub repository contains the data used in this post. If you are following along with this tutorial, make sure that your filepaths point to the raw_data
directory, or whatever you have named the directory that contains your data.
Creating and processing record collections
Parsing raw data to create records and record collections
All we need to do to create a Record Collection
is provide the file path to the raw data. We will use the information science and bibliometrics dataset used in the “Introducing metaknowledge” article, which we have stored in a directory called raw_data/imetrics/
.
We can easily write the full dataframe to a .csv
file using the writeCSV
method. This file can be used by any other research software. Let’s save it in a directory called generated_datasets
.
Of course, it is also possible to continue working in Python. metaknowledge has some useful functions for working with Record Collections
, but researchers can also use other Python packages such as Pandas.
The code block below uses the metaknowledge method yearSplit
to extract the records published in 2013 and 2014 and shows the estimates for author gender. The process for estimating author genders uses birth record and name data. It is described in the article.
The glimpse
method in metaknowledge is a convenient way to quickly view the most frequently occurring authors and journals, and the most highly cited articles. It will print a quick summary to screen.
While glimpse
is useful for getting a quick sense of the most frequently appearing authors and journals, and the most highly cited documents, most research workflows require direct interaction with the data stored in the Record Collection
. The easiest way to do this is to convert the Record Collection
into a Pandas dataframe. This provides access to a wide range of methods for selecting, filtering, grouping, summarizing, modeling, and plotting data.
In addition to the Pandas documentation, researchers may want to consult Wes McKinney’s book Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython or an online tutorial (e.g. Julia Evans’ Pandas cookbook).
AF | AB | PY | TI | SO | num-Authors | TC | |
---|---|---|---|---|---|---|---|
0 | [Bordons, M, Zulueta, MA, Romero, F, Barrigon, S] | A Multidisciplinary Research Programme (MRP) i... | 1999 | Measuring interdisciplinary collaboration with... | SCIENTOMETRICS | 4 | 26 |
1 | [Yan, Erjia, Guns, Raf] | This study examines collaboration dynamics wit... | 2014 | Predicting and recommending collaborations: An... | JOURNAL OF INFORMETRICS | 2 | 7 |
2 | [Davenport, Elisabeth] | None | 2009 | Everyday Information Practices: A Social Pheno... | JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... | 1 | 1 |
3 | [Schubert, Andras, Korn, Andras, Telcs, Andras] | Hirsch-type indices are devised for characteri... | 2009 | Hirsch-type indices for characterizing networks | SCIENTOMETRICS | 3 | 19 |
4 | [Liang, LM, Kretschmer, H, Guo, YZ, Beaver, DD] | This paper is a scientometric study of the age... | 2001 | Age structures of scientific collaboration in... | SCIENTOMETRICS | 4 | 20 |
The two letter variable names are the tags used by Web of Science. See the description of the content of each tag online. The code block above shows 7 tags that are typically of interest, but there are many others available.
We can sort these dataframes by any quantitative variable. Below, we extract the 40 most highly cited articles in the dataset. Rather than print all 40, we will use .head()
to print to top 5.
AF | AB | PY | TI | SO | num_Authors | TC | |
---|---|---|---|---|---|---|---|
5760 | [Egghe, Leo] | The g-index is introduced as an improvement of... | 2006 | Theory and practise of the g-index | SCIENTOMETRICS | 1 | 538 |
7462 | [Ho, YS] | This study presents a literature review concer... | 2004 | Citation review of Lagergren kinetic rate equa... | SCIENTOMETRICS | 1 | 527 |
3788 | [Liben-Nowell, David, Kleinberg, Jon] | Given a snapshot of a social network, can we i... | 2007 | The link-prediction problem for social networks | JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... | 2 | 437 |
1348 | [Spink, A, Wolfram, D, Jansen, MBJ, Saracevic, T] | In studying actual Web searching by the public... | 2001 | Searching the Web: The public and their queries | JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... | 4 | 365 |
3088 | [Meho, Lokman I., Yang, Kiduk] | The Institute for Scientific Information's (IS... | 2007 | Impact of data sources on citation counts and... | JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATIO... | 2 | 315 |
We can create an interactive barchart with plot.ly very easily from this dataframe. In general, barcharts are probably the least interesting graphs to make interactive, but in this case it is useful because hovering over the bars displays the full title information (click on the graph first).
Click on the graph to see it on plot.ly.
If you are producing a static graph, it makes sense to shorten the titles. Otherwise the y-axis label would be as wide as the longest article title. We will create a new column called short_title
that will contain the first 20 characters in the title. We can then use this new variable to create a horizontal bar graph with reasonable labels on the y-axis.
TI | short_title | TC | |
---|---|---|---|
5760 | Theory and practise of the g-index | Theory and practise | 538 |
7462 | Citation review of Lagergren kinetic rate equa... | Citation review of L | 527 |
3788 | The link-prediction problem for social networks | The link-prediction | 437 |
1348 | Searching the Web: The public and their queries | Searching the Web: T | 365 |
3088 | Impact of data sources on citation counts and... | Impact of data sourc | 315 |
Creating time series datasets
It is also easy to plot time series graphs, for example of article publications over time. We can do this by using the timeSeries
method in metaknowledge, and converting the results into a Pandas dataframe that can be easily plotted. We will write the time series dataset to a csv
file at the same time.
count | entry | year | |
---|---|---|---|
2 | 643 | 2014 | 2014 |
3 | 572 | 2013 | 2013 |
4 | 545 | 2012 | 2012 |
5 | 494 | 2011 | 2011 |
6 | 535 | 2010 | 2010 |
7 | 467 | 2009 | 2009 |
8 | 389 | 2008 | 2008 |
9 | 395 | 2007 | 2007 |
10 | 353 | 2006 | 2006 |
11 | 267 | 2005 | 2005 |
We can easily create an interactive line graph of this dataframe.
Click on the graph to see it on plot.ly.
Of course, it would be a bit more interesting to compare the number of publications over time for each of the journals in the dataset, so let’s do that.
Click on this graph and hover over to show us the number of publications for each journal for each year. Note that although the first issue of Journal of the American Society for Information Science and Technology came out in 1950, the Web of Science does not have meta-data on the journal before 2001.
Click on the graph to see it on plot.ly.
Next time, network analysis with metaknowledge!