Network analysis with metaknowledge

Tuesday, August 29, 2017
by John McLevey

This post on generating and analyzing scientific networks with metaknowledge builds on a previous post about the RecordCollection object. Like last time, this post is a slightly cleaned up version of the online supplements for the article I wrote on metaknowledge with Reid-McIlroy-Young. You can read “Introducing metaknowledge: Software for Computational Research in Information Science, Science of Science, and Network Analysis”. Open access version coming soon.

If you would prefer to work through this material in a Jupyter Notebook, you should download the original online supplement directory. It includes all the data from the article plus four notebooks that follow along with the article. The notebooks were prepared by me, Reid McIlroy-Young (my former student, now a graduate student at University of Chicago), and Jillian Anderson (my former student, now an MSc student at Simon Fraser University).1

We are going to produce quite a few visualizations of networks in this post. While I love network visualizations, you generally shouldn’t rely on them for any serious bibliometric or social science research purposes, especially when you have large networks like the ones we are dealing with here. I’m producing lots of visualizations to show you how you can get started working with the network objects that metaknowledge produces. What you do with those network objects in your own research is, obviously, entirely up to you.

The second reason why I am producing lots of network visualizations in this post is to demonstrate that it is possible to produce slightly-better-than-hideous visualizations of networks in Python. Even for large networks, like these ones. But if you are really dependent on visualizations for one reason or another, you should probably write your networks to disk and read them into software like Gephi or Visone.

Let’s get started. We will use the same information science dataset used in the last metaknowledge post.

import metaknowledge as mk
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
import community # install python-louvain for community detection
import pandas

# for interactive graphs
import plotly.plotly as py
import plotly.graph_objs as go

plt.rc("savefig", dpi=600)
sns.set(font_scale=.75)

Read data, create a RecordCollection, and generate a network

Like last time, we will begin by producing a RecordCollection object.

RC = mk.RecordCollection('raw_data/imetrics/', cached = True)
RC1014 = RC.yearSplit(2010,2014)

Now we can easily use any of the network generator methods to get a network. Let’s use the .networkCoAuthor() method to get a co-authorship network, and then use the .graphStats() method to get an idea of the size of the network.

coauth_net = RC.networkCoAuthor()
print(mk.graphStats(coauth_net))

Nodes: 10104
Edges: 15507
Isolates: 1111
Self loops: 0
Density: 0.000303818
Transitivity: 0.555409

Once the network has be generated, you typically want to process it a bit. With metaknowledge, you can modify the network that is already loaded in your computer’s memory rather than creating a copy. In this case, we will drop any edges with a weight less than 2, then drop self-loops, and finally extract the giant component.

mk.dropEdges(coauth_net, minWeight = 2, dropSelfLoops = True)
giant_coauth = max(nx.connected_component_subgraphs(coauth_net), key=len)
print(mk.graphStats(giant_coauth))

Nodes: 265
Edges: 443
Isolates: 0
Self loops: 0
Density: 0.0126644
Transitivity: 0.285714

We are left with a much smaller network! If this was a research project, you would want to make different decisions.

We can use any networkx methods to analyze our network, such as computing centralities and global network properties. For example, let’s compute some common centrality measures and store them in a Pandas dataframe. Once we have the dataframe, we can sort the rows using the .sort_values() method from Pandas. Instead of printing the entire dataframe (there are still 265 rows), I’ll just show the top 10.

deg = nx.degree_centrality(giant_coauth)
eig = nx.eigenvector_centrality(giant_coauth)

cent_df = pandas.DataFrame.from_dict([deg, eig])
cent_df = pandas.DataFrame.transpose(cent_df)
cent_df.columns = ['degree', 'eigenvector']
cent_df.sort_values('degree', ascending = False)[:10]

  degree eigenvector betweenness closeness
Thelwall, Mike 0.098485 6.875438e-04 0.217349 0.241758
Rousseau, Ronald 0.079545 2.021960e-03 0.191199 0.233216
Leydesdorff, Loet 0.079545 3.707337e-01 0.755110 0.314660
Ding, Ying 0.071970 1.523765e-03 0.168347 0.238914
Lariviere, Vincent 0.068182 1.008884e-03 0.108039 0.239782
Glanzel, Wolfgang 0.056818 4.564843e-03 0.158634 0.232190
Sugimoto, Cassidy R. 0.053030 1.258804e-03 0.077937 0.245125
Bornmann, Lutz 0.045455 6.510081e-01 0.041503 0.251908
de Moya-Anegon, Felix 0.041667 1.356966e-02 0.103560 0.249292
Wang, Xianwen 0.041667 6.978493e-07 0.074000 0.149153

We can produce bar graphs of our centrality scores to facilitate comparisons. For example, let’s look at the top 100 degree scores. We will use plot.ly again, so that our graphs are interactive. Interactivity makes it easier to read the names on the x-axis. (Although, we could just produce a horizontal bar graph instead.)

cent_df_d100 = cent_df.sort_values('degree', ascending = False)[:100]

trace = go.Bar(
    x = cent_df_d100.index,
    y = cent_df_d100['degree']
)

data = [trace]

layout = go.Layout(
    yaxis=dict(
        title='Degree Centrality',
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='cent-dist')

Click on the graph to see it on plot.ly.

Next, let’s look at the top 100 betweenness centrality scores.

cent_df_b100 = cent_df.sort_values('betweenness', ascending = False)[:100]

trace = go.Bar(
    x = cent_df_b100.index,
    y = cent_df_b100['betweenness']
)

data = [trace]

layout = go.Layout(
    yaxis=dict(
        title='Betweenness Centrality',
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='cent-dist-b')

Click on the graph to see it on plot.ly.

And the top 100 eigenvector centrality scores.

cent_df_e100 = cent_df.sort_values('eigenvector', ascending = False)[:100]

trace = go.Bar(
    x = cent_df_e100.index,
    y = cent_df_e100['eigenvector']
)

data = [trace]
layout = go.Layout(
    yaxis=dict(
        title='Eigenvector Centrality',
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='cent-dist-e')

Click on the graph to see it on plot.ly.

Of course we are not limited to bar charts for exploring centrality. We can also produce simple scatterplots.2

trace = go.Scatter(
    x = cent_df['degree'],
    y = cent_df['betweenness'],
    mode = 'markers'
)

data = [trace]

layout = go.Layout(
    xaxis=dict(
        title='Degree Centrality',
    ),
    yaxis=dict(
        title='Betweenness Centrality',
    )
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='centralities-scatter')

Click on the graph to see it on plot.ly.

Sometimes you need a static plot for a journal article. Within the Python ecosystem, I prefer Seaborn. Producing a scatterplot is straightforward.

with sns.axes_style('white'):
    sns.jointplot(x='degree', y='eigenvector', data=cent_df, xlim = (0, .1), ylim = (0, .7), color = 'gray')
    sns.despine()
plt.savefig('figures/cent_scatterplot.png')
plt.savefig('figures/cent_scatterplot.p')

Scatterplot graph using Seaborn.

Download a higher resolution of the scatterplot graph in Seaborn (.pdf)

Visualizing networks

Of course, we can also visualize the networks themselves. Networkx does an adequete job of this, provided the networks are not too large.

eig = nx.eigenvector_centrality(giant_coauth)
size = [2000 * eig[node] for node in giant_coauth]

nx.draw_spring(giant_coauth, node_size = size, with_labels = True, font_size = 5,
               node_color = "#FFFFFF", edge_color = "#D4D5CE", alpha = .95)
plt.savefig('figures/network_coauthors.png')
plt.savefig('figures/network_coauthors.pdf')

Network graph of co-authors.

Download a higher resolution of the co-authors network graph (.pdf)

Community detection

If you install the Python-louvain package (with pip3 install python-louvain), you can perform Louvain community detection3 on your networks quite easily. Once installed on your machine, you load the package with import community.

partition = community.best_partition(giant_coauth)
modularity = community.modularity(partition, giant_coauth)
print('Modularity:', modularity)

Modularity: 0.8385764300280952

Once we have computed the modularity and partitioned the network, we can color the nodes based on their community membership. I’m using the “Set2” color palette.

colors = [partition[n] for n in giant_coauth.nodes()]
my_colors = plt.cm.Set2 # you can select other color pallettes here: https://matplotlib.org/users/colormaps.html
nx.draw(giant_coauth, node_color=colors, cmap = my_colors, edge_color = "#D4D5CE")
plt.savefig('figures/coauthors_community.png')
plt.savefig('figures/coauthors_community.pdf')

Network graph of co-authors community.

Download a higher resolution of the co-authors community network graph (.pdf)

Co-citation networks

So far we have been looking at an example of a co-authorship network. Let’s dig into a few other network generators, starting with journal-level co-citation. In the example below we will set coreOnly = True, which means that metaknowledge will only add nodes to the co-citation network if the document in question was part of the original RecordCollection object. If you set it to False (which is the default), then metaknowledge will produce a much larger co-citation that includes every item that appears in every bibliography for all of the Records in your RecordCollection.

In addition, we are going to remove all edges that don’t meet a given threshold: 3. In a co-citation network, you can (and should) get rid of a lot of noise by simply ignoring co-citations that only happen once or twice.

journal_cocite = RC1014.networkCoCitation(coreOnly = True)
mk.dropEdges(journal_cocite , minWeight = 3)
print(mk.graphStats(journal_cocite))

Nodes: 1261
Edges: 1119
Isolates: 889
Self loops: 21
Density: 0.00140856
Transitivity: 0.300846

Let’s just focus on the giant component.

# visualize the giant component only
giantJournal = max(nx.connected_component_subgraphs(journal_cocite), key=len)

nx.draw_spring(giantJournal, with_labels = False, node_size = 75,
              node_color = "#77787B", edge_color = "#D4D5CE", alpha = .95)
plt.savefig('figures/network_journal_cocite.png')
plt.savefig('figures/network_journal_cocite.pdf')

Network graph of co-citation networks.

Download a higher resolution of the co-citation network graph (.pdf)

We can also run community detection algorithms on co-citation networks.

partition = community.best_partition(giantJournal)
modularity = community.modularity(partition, giantJournal)
print('Modularity:', modularity)

colors = [partition[n] for n in giantJournal.nodes()]
nx.draw_spring(giantJournal, node_color=colors, with_labels = False, cmap=plt.cm.tab10, node_size = 100, edge_color = "#D4D5CE")
plt.savefig('figures/network_journal_cocite_community.png')
plt.savefig('figures/network_journal_cocite_community.pdf')

Modularity: 0.4262191650801135

Network graph of co-citation networks community.

Download a higher resolution of the co-citation networks community network graph (.pdf)

Co-investigator networks

metaknowledge also simplifies the process of generating collaboration networks for co-investigators on grants. This time we are using to .networkCoInvestigator(), which is a method of the GrantCollection object (not the RecordCollection object).

We are going to switch up the dataset for this network. We will work with the NSERC (National Science and Engineering Research Council of Canada) data in the ‘raw_data’ directory of the Online Supplement GitHub repository for the metaknowledge article.

nserc_grants = mk.GrantCollection('raw_data/grants/nserc/')
print('There are', len(nserc_grants), 'Grants in this Grant Collection.')

There are 71184 Grants in this Grant Collection.

ci_nets = nserc_grants.networkCoInvestigator()
print(mk.graphStats(ci_nets))

Nodes: 33655
Edges: 130586
Isolates: 26284
Self loops: 4
Density: 0.00023059
Transitivity: 0.902158

For this example, we can restrict this network to recurring collaborations within the giant component. This makes plotting easier because it makes the network a lot smaller. For that same reason, you would likely make a different decision if this was a real research project.

mk.dropEdges(ci_nets, minWeight = 4)
giant_ci = max(nx.connected_component_subgraphs(ci_nets), key=len)
print(mk.graphStats(giant_ci))

Nodes: 250
Edges: 680
Isolates: 0
Self loops: 0
Density: 0.0218474
Transitivity: 0.679722

partition_ci = community.best_partition(giant_ci)
modularity_ci = community.modularity(partition_ci, giant_ci)
print('Modularity:', modularity_ci)

colors_ci = [partition_ci[n] for n in giant_ci.nodes()]
nx.draw_spring(giant_ci, node_color=colors_ci, with_labels = False, cmap=plt.cm.tab10, node_size = 100, edge_color = "#D4D5CE")
plt.savefig('figures/network_coinvestigators.png')
plt.savefig('figures/network_coinvestigators.pdf')

Modularity: 0.8521804432230576

Network graph of co-investigator networks.

Download a higher resolution of the network graph of co-investigator networks (.pdf)

As before, we can easily identify researchers with the highest centrality scores. Here are the researchers with the top 10 betweenness.

bet = nx.betweenness_centrality(giant_ci)
bet_df = pandas.DataFrame.from_dict([bet]).transpose()
bet_df.columns = ['betweenness']
bet_df.sort_values(by = ['betweenness'], ascending = False)[:10]

  betweenness
Mi, Zetian 0.555067
Farnood, Ramin 0.423123
Botton, Gianluigi 0.403399
Kortschot, Mark 0.345576
Sain, Mohini 0.341042
Kherani, Nazir 0.293202
Wilkinson, David 0.283616
Ruda, Harry 0.256005
VandeVen, Theodorus 0.238875
Hill, Reghan 0.238875

Or we could look at a bar graph of 100 researchers with the highest betweenness centrality scores.

topbet_nserc = bet_df.sort_values('betweenness', ascending = False)[:100]

trace = go.Bar(
    x = topbet_nserc.index,
    y = topbet_nserc['betweenness']
)

data = [trace]

layout = go.Layout(
    yaxis=dict(
        title='Betweenness Centrality',
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='betweenness_nserc')

Click on the graph to see it on plot.ly.

Institution-level co-investigator networks

All of this can be done at the institution level as well.

inst = nserc_grants.networkCoInvestigatorInstitution()
print(mk.graphStats(inst))

Nodes: 5489
Edges: 32552
Isolates: 823
Self loops: 165
Density: 0.00216123
Transitivity: 0.17326

deg_inst = nx.degree_centrality(inst)
deg_inst_df = pandas.DataFrame.from_dict([deg_inst]).transpose()
deg_inst_df.columns = ['Degree Centrality']
deg_inst_df.sort_values(by = ['Degree Centrality'], ascending = False)[:15]

  Degree Centrality
University of British Columbia 0.156159
University of Toronto 0.148688
University of Waterloo 0.140306
University of Alberta 0.129009
McGill University 0.120445
UniversitÈ Laval 0.107507
…cole Polytechnique de MontrÈal 0.092748
University of Calgary 0.090561
University of Ottawa 0.085459
Queen's University 0.083273
McMaster University 0.074708
University of Guelph 0.069606
Carleton University 0.067238
University of Western Ontario 0.066509
UniversitÈ de Sherbrooke 0.063411

inst_cent = deg_inst_df.sort_values('Degree Centrality', ascending = False)[:100]

trace = go.Bar(
    x = inst_cent.index,
    y = inst_cent['Degree Centrality']
)

data = [trace]

layout = go.Layout(
    yaxis=dict(
        title='Degree Centrality',
    )
)

fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='degree_nserc_inst')

Click on the graph to see it on plot.ly.

eig_inst = nx.eigenvector_centrality(inst)
eig_inst_df = pandas.DataFrame.from_dict([eig_inst]).transpose()
eig_inst_df.columns = ['Eigenvector Centrality']
eig_inst_df.sort_values(by = ['Eigenvector Centrality'], ascending = False)[:15]

  Eigenvector Centrality
University of British Columbia 0.156159
University of Toronto 0.148688
University of Waterloo 0.140306
University of Alberta 0.129009
McGill University 0.120445
UniversitÈ Laval 0.107507
…cole Polytechnique de MontrÈal 0.092748
University of Calgary 0.090561
University of Ottawa 0.085459
Queen's University 0.083273
McMaster University 0.074708
University of Guelph 0.069606
Carleton University 0.067238
University of Western Ontario 0.066509
UniversitÈ de Sherbrooke 0.063411

Wrapping up: Writing networks to disks

The Jupyter Notebooks on GitHub have more examples, including of keyword co-occurrence networks, two-mode networks, and multi-mode networks. I’ll work some of that material into a blog post sometime, but if you are interested in seeing it now, all of the code is already available over there.

One of the central design goals with metaknowledge was to make integrations with other software seamless. I was especially obsessed with making it as easy as possible to go between metaknowledge and the statnet suite of R libraries, which have no equals in the Python world. As such, it’s really easy to write your networks to disk and open them up using other software. The main function in metaknowledge for writing to disk is writeGraph(), which produces a weighted edge list and a node attribute file, both in csv format.

mk.writeGraph(inst , 'generated_datasets/institutional_collaboration_network/')

Of course, because networkx functions work on metaknowledge networks, you can use any of the networkx writers as well. For example, graphml.

nx.write_graphml(inst, 'generated_datasets/institutional_collaboration_network/inst_network.graphml')


  1. Reid and Jillian are both recent graduates of NetLab at the University of Waterloo.
  2. With a bit of extra code you can also make plot.ly display the node names when you hover over the points in the scatterplot. To keep things relatively simple, I won’t get into that here.
  3. See the Blondel et al. (2002) “Fast unfolding of communities in large networks” paper. See the arXiv pre-print of the “Fast unfolding of communities in large networks” paper (.pdf).