This post on generating and analyzing scientific networks with metaknowledge builds on a previous post about the RecordCollection
object. Like last time, this post is a slightly cleaned up version of the online supplements for the article I wrote on metaknowledge with Reid-McIlroy-Young. You can read “Introducing metaknowledge: Software for Computational Research in Information Science, Science of Science, and Network Analysis”. Open access version coming soon.
If you would prefer to work through this material in a Jupyter Notebook, you should download the original online supplement directory. It includes all the data from the article plus four notebooks that follow along with the article. The notebooks were prepared by me, Reid McIlroy-Young (my former student, now a graduate student at University of Chicago), and Jillian Anderson (my former student, now an MSc student at Simon Fraser University).1
We are going to produce quite a few visualizations of networks in this post. While I love network visualizations, you generally shouldn’t rely on them for any serious bibliometric or social science research purposes, especially when you have large networks like the ones we are dealing with here. I’m producing lots of visualizations to show you how you can get started working with the network objects that metaknowledge produces. What you do with those network objects in your own research is, obviously, entirely up to you.
The second reason why I am producing lots of network visualizations in this post is to demonstrate that it is possible to produce slightly-better-than-hideous visualizations of networks in Python. Even for large networks, like these ones. But if you are really dependent on visualizations for one reason or another, you should probably write your networks to disk and read them into software like Gephi or Visone.
Let’s get started. We will use the same information science dataset used in the last metaknowledge post.
Read data, create a RecordCollection
, and generate a network
Like last time, we will begin by producing a RecordCollection
object.
Now we can easily use any of the network generator methods to get a network. Let’s use the .networkCoAuthor()
method to get a co-authorship network, and then use the .graphStats()
method to get an idea of the size of the network.
Once the network has be generated, you typically want to process it a bit. With metaknowledge, you can modify the network that is already loaded in your computer’s memory rather than creating a copy. In this case, we will drop any edges with a weight less than 2, then drop self-loops, and finally extract the giant component.
We are left with a much smaller network! If this was a research project, you would want to make different decisions.
We can use any networkx methods to analyze our network, such as computing centralities and global network properties. For example, let’s compute some common centrality measures and store them in a Pandas dataframe. Once we have the dataframe, we can sort the rows using the .sort_values()
method from Pandas. Instead of printing the entire dataframe (there are still 265 rows), I’ll just show the top 10.
degree | eigenvector | betweenness | closeness | |
---|---|---|---|---|
Thelwall, Mike | 0.098485 | 6.875438e-04 | 0.217349 | 0.241758 |
Rousseau, Ronald | 0.079545 | 2.021960e-03 | 0.191199 | 0.233216 |
Leydesdorff, Loet | 0.079545 | 3.707337e-01 | 0.755110 | 0.314660 |
Ding, Ying | 0.071970 | 1.523765e-03 | 0.168347 | 0.238914 |
Lariviere, Vincent | 0.068182 | 1.008884e-03 | 0.108039 | 0.239782 |
Glanzel, Wolfgang | 0.056818 | 4.564843e-03 | 0.158634 | 0.232190 |
Sugimoto, Cassidy R. | 0.053030 | 1.258804e-03 | 0.077937 | 0.245125 |
Bornmann, Lutz | 0.045455 | 6.510081e-01 | 0.041503 | 0.251908 |
de Moya-Anegon, Felix | 0.041667 | 1.356966e-02 | 0.103560 | 0.249292 |
Wang, Xianwen | 0.041667 | 6.978493e-07 | 0.074000 | 0.149153 |
We can produce bar graphs of our centrality scores to facilitate comparisons. For example, let’s look at the top 100 degree scores. We will use plot.ly again, so that our graphs are interactive. Interactivity makes it easier to read the names on the x-axis. (Although, we could just produce a horizontal bar graph instead.)
Click on the graph to see it on plot.ly.
Next, let’s look at the top 100 betweenness centrality scores.
Click on the graph to see it on plot.ly.
And the top 100 eigenvector centrality scores.
Click on the graph to see it on plot.ly.
Of course we are not limited to bar charts for exploring centrality. We can also produce simple scatterplots.2
Click on the graph to see it on plot.ly.
Sometimes you need a static plot for a journal article. Within the Python ecosystem, I prefer Seaborn. Producing a scatterplot is straightforward.
Download a higher resolution of the scatterplot graph in Seaborn (.pdf)
Visualizing networks
Of course, we can also visualize the networks themselves. Networkx does an adequete job of this, provided the networks are not too large.
Download a higher resolution of the co-authors network graph (.pdf)
Community detection
If you install the Python-louvain package (with pip3 install python-louvain
), you can perform Louvain community detection3 on your networks quite easily. Once installed on your machine, you load the package with import community
.
Once we have computed the modularity and partitioned the network, we can color the nodes based on their community membership. I’m using the “Set2” color palette.
Download a higher resolution of the co-authors community network graph (.pdf)
Co-citation networks
So far we have been looking at an example of a co-authorship network. Let’s dig into a few other network generators, starting with journal-level co-citation. In the example below we will set coreOnly = True
, which means that metaknowledge will only add nodes to the co-citation network if the document in question was part of the original RecordCollection
object. If you set it to False
(which is the default), then metaknowledge will produce a much larger co-citation that includes every item that appears in every bibliography for all of the Records
in your RecordCollection
.
In addition, we are going to remove all edges that don’t meet a given threshold: 3. In a co-citation network, you can (and should) get rid of a lot of noise by simply ignoring co-citations that only happen once or twice.
Let’s just focus on the giant component.
Download a higher resolution of the co-citation network graph (.pdf)
We can also run community detection algorithms on co-citation networks.
Download a higher resolution of the co-citation networks community network graph (.pdf)
Co-investigator networks
metaknowledge also simplifies the process of generating collaboration networks for co-investigators on grants. This time we are using to .networkCoInvestigator()
, which is a method of the GrantCollection
object (not the RecordCollection
object).
We are going to switch up the dataset for this network. We will work with the NSERC (National Science and Engineering Research Council of Canada) data in the ‘raw_data’ directory of the Online Supplement GitHub repository for the metaknowledge article.
For this example, we can restrict this network to recurring collaborations within the giant component. This makes plotting easier because it makes the network a lot smaller. For that same reason, you would likely make a different decision if this was a real research project.
Download a higher resolution of the network graph of co-investigator networks (.pdf)
As before, we can easily identify researchers with the highest centrality scores. Here are the researchers with the top 10 betweenness.
betweenness | |
---|---|
Mi, Zetian | 0.555067 |
Farnood, Ramin | 0.423123 |
Botton, Gianluigi | 0.403399 |
Kortschot, Mark | 0.345576 |
Sain, Mohini | 0.341042 |
Kherani, Nazir | 0.293202 |
Wilkinson, David | 0.283616 |
Ruda, Harry | 0.256005 |
VandeVen, Theodorus | 0.238875 |
Hill, Reghan | 0.238875 |
Or we could look at a bar graph of 100 researchers with the highest betweenness centrality scores.
Click on the graph to see it on plot.ly.
Institution-level co-investigator networks
All of this can be done at the institution level as well.
Degree Centrality | |
---|---|
University of British Columbia | 0.156159 |
University of Toronto | 0.148688 |
University of Waterloo | 0.140306 |
University of Alberta | 0.129009 |
McGill University | 0.120445 |
Università Laval | 0.107507 |
â¦cole Polytechnique de MontrÃal | 0.092748 |
University of Calgary | 0.090561 |
University of Ottawa | 0.085459 |
Queen's University | 0.083273 |
McMaster University | 0.074708 |
University of Guelph | 0.069606 |
Carleton University | 0.067238 |
University of Western Ontario | 0.066509 |
Università de Sherbrooke | 0.063411 |
Click on the graph to see it on plot.ly.
Eigenvector Centrality | |
---|---|
University of British Columbia | 0.156159 |
University of Toronto | 0.148688 |
University of Waterloo | 0.140306 |
University of Alberta | 0.129009 |
McGill University | 0.120445 |
Università Laval | 0.107507 |
â¦cole Polytechnique de MontrÃal | 0.092748 |
University of Calgary | 0.090561 |
University of Ottawa | 0.085459 |
Queen's University | 0.083273 |
McMaster University | 0.074708 |
University of Guelph | 0.069606 |
Carleton University | 0.067238 |
University of Western Ontario | 0.066509 |
Università de Sherbrooke | 0.063411 |
Wrapping up: Writing networks to disks
The Jupyter Notebooks on GitHub have more examples, including of keyword co-occurrence networks, two-mode networks, and multi-mode networks. I’ll work some of that material into a blog post sometime, but if you are interested in seeing it now, all of the code is already available over there.
One of the central design goals with metaknowledge was to make integrations with other software seamless. I was especially obsessed with making it as easy as possible to go between metaknowledge and the statnet suite of R libraries, which have no equals in the Python world. As such, it’s really easy to write your networks to disk and open them up using other software. The main function in metaknowledge for writing to disk is writeGraph()
, which produces a weighted edge list and a node attribute file, both in csv
format.
Of course, because networkx functions work on metaknowledge networks, you can use any of the networkx writers as well. For example, graphml.
- Reid and Jillian are both recent graduates of NetLab at the University of Waterloo.
- With a bit of extra code you can also make plot.ly display the node names when you hover over the points in the scatterplot. To keep things relatively simple, I won’t get into that here.
- See the Blondel et al. (2002) “Fast unfolding of communities in large networks” paper. See the arXiv pre-print of the “Fast unfolding of communities in large networks” paper (.pdf).