Network analysis with metaknowledge

This post on generating and analyzing scientific networks with metaknowledge builds on a previous post about the RecordCollection object. Like last time, this post is a slightly cleaned up version of the online supplements for the article I wrote on metaknowledge with Reid-McIlroy-Young. You can read “Introducing metaknowledge: Software for Computational Research in Information Science, Science of Science, and Network Analysis”. Open access version coming soon.

If you would prefer to work through this material in a Jupyter Notebook, you should download the original online supplement directory. It includes all the data from the article plus four notebooks that follow along with the article. The notebooks were prepared by me, Reid McIlroy-Young (my former student, now a graduate student at University of Chicago), and Jillian Anderson (my former student, now an MSc student at Simon Fraser University).¹

We are going to produce quite a few visualizations of networks in this post. While I love network visualizations, you generally shouldn’t rely on them for any serious bibliometric or social science research purposes, especially when you have large networks like the ones we are dealing with here. I’m producing lots of visualizations to show you how you can get started working with the network objects that metaknowledge produces. What you do with those network objects in your own research is, obviously, entirely up to you.

The second reason why I am producing lots of network visualizations in this post is to demonstrate that it is possible to produce slightly-better-than-hideous visualizations of networks in Python. Even for large networks, like these ones. But if you are really dependent on visualizations for one reason or another, you should probably write your networks to disk and read them into software like Gephi or Visone.

Let’s get started. We will use the same information science dataset used in the last metaknowledge post.

import metaknowledge as mk import matplotlib.pyplot as plt import seaborn as sns import networkx as nx import community # install python-louvain for community detection import pandas # for interactive graphs import plotly.plotly as py import plotly.graph_objs as go plt.rc("savefig", dpi=600) sns.set(font_scale=.75)

Read data, create a `RecordCollection`, and generate a network

Like last time, we will begin by producing a RecordCollection object.

RC = mk.RecordCollection('raw_data/imetrics/', cached = True) RC1014 = RC.yearSplit(2010,2014)

Now we can easily use any of the network generator methods to get a network. Let’s use the .networkCoAuthor() method to get a co-authorship network, and then use the .graphStats() method to get an idea of the size of the network.

coauth_net = RC.networkCoAuthor() print(mk.graphStats(coauth_net))

Nodes: 10104 Edges: 15507 Isolates: 1111 Self loops: 0 Density: 0.000303818 Transitivity: 0.555409

Once the network has be generated, you typically want to process it a bit. With metaknowledge, you can modify the network that is already loaded in your computer’s memory rather than creating a copy. In this case, we will drop any edges with a weight less than 2, then drop self-loops, and finally extract the giant component.

mk.dropEdges(coauth_net, minWeight = 2, dropSelfLoops = True) giant_coauth = max(nx.connected_component_subgraphs(coauth_net), key=len) print(mk.graphStats(giant_coauth))

Nodes: 265 Edges: 443 Isolates: 0 Self loops: 0 Density: 0.0126644 Transitivity: 0.285714

We are left with a much smaller network! If this was a research project, you would want to make different decisions.

We can use any networkx methods to analyze our network, such as computing centralities and global network properties. For example, let’s compute some common centrality measures and store them in a Pandas dataframe. Once we have the dataframe, we can sort the rows using the .sort_values() method from Pandas. Instead of printing the entire dataframe (there are still 265 rows), I’ll just show the top 10.

deg = nx.degree_centrality(giant_coauth) eig = nx.eigenvector_centrality(giant_coauth) cent_df = pandas.DataFrame.from_dict([deg, eig]) cent_df = pandas.DataFrame.transpose(cent_df) cent_df.columns = ['degree', 'eigenvector'] cent_df.sort_values('degree', ascending = False)[:10]

	degree	eigenvector	betweenness	closeness
Thelwall, Mike	0.098485	6.875438e-04	0.217349	0.241758
Rousseau, Ronald	0.079545	2.021960e-03	0.191199	0.233216
Leydesdorff, Loet	0.079545	3.707337e-01	0.755110	0.314660
Ding, Ying	0.071970	1.523765e-03	0.168347	0.238914
Lariviere, Vincent	0.068182	1.008884e-03	0.108039	0.239782
Glanzel, Wolfgang	0.056818	4.564843e-03	0.158634	0.232190
Sugimoto, Cassidy R.	0.053030	1.258804e-03	0.077937	0.245125
Bornmann, Lutz	0.045455	6.510081e-01	0.041503	0.251908
de Moya-Anegon, Felix	0.041667	1.356966e-02	0.103560	0.249292
Wang, Xianwen	0.041667	6.978493e-07	0.074000	0.149153

We can produce bar graphs of our centrality scores to facilitate comparisons. For example, let’s look at the top 100 degree scores. We will use plot.ly again, so that our graphs are interactive. Interactivity makes it easier to read the names on the x-axis. (Although, we could just produce a horizontal bar graph instead.)

cent_df_d100 = cent_df.sort_values('degree', ascending = False)[:100] trace = go.Bar( x = cent_df_d100.index, y = cent_df_d100['degree'] ) data = [trace] layout = go.Layout( yaxis=dict( title='Degree Centrality', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='cent-dist')

Bar graphs of the centrality scores to facilitate comparisons.

Click on the graph to see it on plot.ly.

Next, let’s look at the top 100 betweenness centrality scores.

cent_df_b100 = cent_df.sort_values('betweenness', ascending = False)[:100] trace = go.Bar( x = cent_df_b100.index, y = cent_df_b100['betweenness'] ) data = [trace] layout = go.Layout( yaxis=dict( title='Betweenness Centrality', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='cent-dist-b')

Bar graphs of the top 100 betweenness centrality scores.

Click on the graph to see it on plot.ly.

And the top 100 eigenvector centrality scores.

cent_df_e100 = cent_df.sort_values('eigenvector', ascending = False)[:100] trace = go.Bar( x = cent_df_e100.index, y = cent_df_e100['eigenvector'] ) data = [trace] layout = go.Layout( yaxis=dict( title='Eigenvector Centrality', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='cent-dist-e')

Bar graphs of the top 100 eigenvector centrality scores.

Click on the graph to see it on plot.ly.

Of course we are not limited to bar charts for exploring centrality. We can also produce simple scatterplots.²

trace = go.Scatter( x = cent_df['degree'], y = cent_df['betweenness'], mode = 'markers' ) data = [trace] layout = go.Layout( xaxis=dict( title='Degree Centrality', ), yaxis=dict( title='Betweenness Centrality', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='centralities-scatter')

Scatterplots graph of the top 100 betweenness centrality scores.

Click on the graph to see it on plot.ly.

Sometimes you need a static plot for a journal article. Within the Python ecosystem, I prefer Seaborn. Producing a scatterplot is straightforward.

with sns.axes_style('white'): sns.jointplot(x='degree', y='eigenvector', data=cent_df, xlim = (0, .1), ylim = (0, .7), color = 'gray') sns.despine() plt.savefig('figures/cent_scatterplot.png') plt.savefig('figures/cent_scatterplot.p')

Download a higher resolution of the scatterplot graph in Seaborn (.pdf)

Visualizing networks

Of course, we can also visualize the networks themselves. Networkx does an adequete job of this, provided the networks are not too large.

eig = nx.eigenvector_centrality(giant_coauth) size = [2000 * eig[node] for node in giant_coauth] nx.draw_spring(giant_coauth, node_size = size, with_labels = True, font_size = 5, node_color = "#FFFFFF", edge_color = "#D4D5CE", alpha = .95) plt.savefig('figures/network_coauthors.png') plt.savefig('figures/network_coauthors.pdf')

Download a higher resolution of the co-authors network graph (.pdf)

Community detection

If you install the Python-louvain package (with pip3 install python-louvain), you can perform Louvain community detection³ on your networks quite easily. Once installed on your machine, you load the package with import community.

partition = community.best_partition(giant_coauth) modularity = community.modularity(partition, giant_coauth) print('Modularity:', modularity)

Modularity: 0.8385764300280952

Once we have computed the modularity and partitioned the network, we can color the nodes based on their community membership. I’m using the “Set2” color palette.

colors = [partition[n] for n in giant_coauth.nodes()] my_colors = plt.cm.Set2 # you can select other color pallettes here: https://matplotlib.org/users/colormaps.html nx.draw(giant_coauth, node_color=colors, cmap = my_colors, edge_color = "#D4D5CE") plt.savefig('figures/coauthors_community.png') plt.savefig('figures/coauthors_community.pdf')

Download a higher resolution of the co-authors community network graph (.pdf)

Co-citation networks

So far we have been looking at an example of a co-authorship network. Let’s dig into a few other network generators, starting with journal-level co-citation. In the example below we will set coreOnly = True, which means that metaknowledge will only add nodes to the co-citation network if the document in question was part of the original RecordCollection object. If you set it to False (which is the default), then metaknowledge will produce a much larger co-citation that includes every item that appears in every bibliography for all of the Records in your RecordCollection.

In addition, we are going to remove all edges that don’t meet a given threshold: 3. In a co-citation network, you can (and should) get rid of a lot of noise by simply ignoring co-citations that only happen once or twice.

journal_cocite = RC1014.networkCoCitation(coreOnly = True) mk.dropEdges(journal_cocite , minWeight = 3) print(mk.graphStats(journal_cocite))

Nodes: 1261 Edges: 1119 Isolates: 889 Self loops: 21 Density: 0.00140856 Transitivity: 0.300846

Let’s just focus on the giant component.

# visualize the giant component only giantJournal = max(nx.connected_component_subgraphs(journal_cocite), key=len)

nx.draw_spring(giantJournal, with_labels = False, node_size = 75, node_color = "#77787B", edge_color = "#D4D5CE", alpha = .95) plt.savefig('figures/network_journal_cocite.png') plt.savefig('figures/network_journal_cocite.pdf')

Download a higher resolution of the co-citation network graph (.pdf)

We can also run community detection algorithms on co-citation networks.

partition = community.best_partition(giantJournal) modularity = community.modularity(partition, giantJournal) print('Modularity:', modularity) colors = [partition[n] for n in giantJournal.nodes()] nx.draw_spring(giantJournal, node_color=colors, with_labels = False, cmap=plt.cm.tab10, node_size = 100, edge_color = "#D4D5CE") plt.savefig('figures/network_journal_cocite_community.png') plt.savefig('figures/network_journal_cocite_community.pdf')

Modularity: 0.4262191650801135

Network graph of co-citation networks community.

Download a higher resolution of the co-citation networks community network graph (.pdf)

Co-investigator networks

metaknowledge also simplifies the process of generating collaboration networks for co-investigators on grants. This time we are using to .networkCoInvestigator(), which is a method of the GrantCollection object (not the RecordCollection object).

We are going to switch up the dataset for this network. We will work with the NSERC (National Science and Engineering Research Council of Canada) data in the ‘raw_data’ directory of the Online Supplement GitHub repository for the metaknowledge article.

nserc_grants = mk.GrantCollection('raw_data/grants/nserc/') print('There are', len(nserc_grants), 'Grants in this Grant Collection.')

There are 71184 Grants in this Grant Collection.

ci_nets = nserc_grants.networkCoInvestigator() print(mk.graphStats(ci_nets))

Nodes: 33655 Edges: 130586 Isolates: 26284 Self loops: 4 Density: 0.00023059 Transitivity: 0.902158

For this example, we can restrict this network to recurring collaborations within the giant component. This makes plotting easier because it makes the network a lot smaller. For that same reason, you would likely make a different decision if this was a real research project.

mk.dropEdges(ci_nets, minWeight = 4) giant_ci = max(nx.connected_component_subgraphs(ci_nets), key=len) print(mk.graphStats(giant_ci))

Nodes: 250 Edges: 680 Isolates: 0 Self loops: 0 Density: 0.0218474 Transitivity: 0.679722

partition_ci = community.best_partition(giant_ci) modularity_ci = community.modularity(partition_ci, giant_ci) print('Modularity:', modularity_ci) colors_ci = [partition_ci[n] for n in giant_ci.nodes()] nx.draw_spring(giant_ci, node_color=colors_ci, with_labels = False, cmap=plt.cm.tab10, node_size = 100, edge_color = "#D4D5CE") plt.savefig('figures/network_coinvestigators.png') plt.savefig('figures/network_coinvestigators.pdf')

Modularity: 0.8521804432230576

Download a higher resolution of the network graph of co-investigator networks (.pdf)

As before, we can easily identify researchers with the highest centrality scores. Here are the researchers with the top 10 betweenness.

bet = nx.betweenness_centrality(giant_ci) bet_df = pandas.DataFrame.from_dict([bet]).transpose() bet_df.columns = ['betweenness'] bet_df.sort_values(by = ['betweenness'], ascending = False)[:10]

	betweenness
Mi, Zetian	0.555067
Farnood, Ramin	0.423123
Botton, Gianluigi	0.403399
Kortschot, Mark	0.345576
Sain, Mohini	0.341042
Kherani, Nazir	0.293202
Wilkinson, David	0.283616
Ruda, Harry	0.256005
VandeVen, Theodorus	0.238875
Hill, Reghan	0.238875

Or we could look at a bar graph of 100 researchers with the highest betweenness centrality scores.

topbet_nserc = bet_df.sort_values('betweenness', ascending = False)[:100] trace = go.Bar( x = topbet_nserc.index, y = topbet_nserc['betweenness'] ) data = [trace] layout = go.Layout( yaxis=dict( title='Betweenness Centrality', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='betweenness_nserc')

Click on the graph to see it on plot.ly.

Institution-level co-investigator networks

All of this can be done at the institution level as well.

inst = nserc_grants.networkCoInvestigatorInstitution() print(mk.graphStats(inst))

Nodes: 5489 Edges: 32552 Isolates: 823 Self loops: 165 Density: 0.00216123 Transitivity: 0.17326

deg_inst = nx.degree_centrality(inst) deg_inst_df = pandas.DataFrame.from_dict([deg_inst]).transpose() deg_inst_df.columns = ['Degree Centrality'] deg_inst_df.sort_values(by = ['Degree Centrality'], ascending = False)[:15]

	Degree Centrality
University of British Columbia	0.156159
University of Toronto	0.148688
University of Waterloo	0.140306
University of Alberta	0.129009
McGill University	0.120445
UniversitÃ Laval	0.107507
â¦cole Polytechnique de MontrÃal	0.092748
University of Calgary	0.090561
University of Ottawa	0.085459
Queen's University	0.083273
McMaster University	0.074708
University of Guelph	0.069606
Carleton University	0.067238
University of Western Ontario	0.066509
UniversitÃ de Sherbrooke	0.063411

inst_cent = deg_inst_df.sort_values('Degree Centrality', ascending = False)[:100] trace = go.Bar( x = inst_cent.index, y = inst_cent['Degree Centrality'] ) data = [trace] layout = go.Layout( yaxis=dict( title='Degree Centrality', ) ) fig = go.Figure(data=data, layout=layout) py.iplot(fig, filename='degree_nserc_inst')

Bar graph of 100 researchers with the highest betweenness centrality scores at the institutional level.

Click on the graph to see it on plot.ly.

eig_inst = nx.eigenvector_centrality(inst) eig_inst_df = pandas.DataFrame.from_dict([eig_inst]).transpose() eig_inst_df.columns = ['Eigenvector Centrality'] eig_inst_df.sort_values(by = ['Eigenvector Centrality'], ascending = False)[:15]

	Eigenvector Centrality
University of British Columbia	0.156159
University of Toronto	0.148688
University of Waterloo	0.140306
University of Alberta	0.129009
McGill University	0.120445
UniversitÃ Laval	0.107507
â¦cole Polytechnique de MontrÃal	0.092748
University of Calgary	0.090561
University of Ottawa	0.085459
Queen's University	0.083273
McMaster University	0.074708
University of Guelph	0.069606
Carleton University	0.067238
University of Western Ontario	0.066509
UniversitÃ de Sherbrooke	0.063411

Wrapping up: Writing networks to disks

The Jupyter Notebooks on GitHub have more examples, including of keyword co-occurrence networks, two-mode networks, and multi-mode networks. I’ll work some of that material into a blog post sometime, but if you are interested in seeing it now, all of the code is already available over there.

One of the central design goals with metaknowledge was to make integrations with other software seamless. I was especially obsessed with making it as easy as possible to go between metaknowledge and the statnet suite of R libraries, which have no equals in the Python world. As such, it’s really easy to write your networks to disk and open them up using other software. The main function in metaknowledge for writing to disk is writeGraph(), which produces a weighted edge list and a node attribute file, both in csv format.

mk.writeGraph(inst , 'generated_datasets/institutional_collaboration_network/')

Of course, because networkx functions work on metaknowledge networks, you can use any of the networkx writers as well. For example, graphml.

nx.write_graphml(inst, 'generated_datasets/institutional_collaboration_network/inst_network.graphml')

Reid and Jillian are both recent graduates of NetLab at the University of Waterloo.
With a bit of extra code you can also make plot.ly display the node names when you hover over the points in the scatterplot. To keep things relatively simple, I won’t get into that here.
See the Blondel et al. (2002) “Fast unfolding of communities in large networks” paper. See the arXiv pre-print of the “Fast unfolding of communities in large networks” paper (.pdf).

Read data, create a RecordCollection, and generate a network

Visualizing networks

Community detection

Co-citation networks

Co-investigator networks

Institution-level co-investigator networks

Wrapping up: Writing networks to disks

Read data, create a `RecordCollection`, and generate a network