Throughout the 2016 US presidential campaign, Hillary Clinton’s private email server was a hot topic. A University of Waterloo professor and senior undergraduate student have now built a tool to analyze the contents of the server.
Initially as part of a fourth-year reading course at the University of Waterloo, statistics major Christopher Salahub and his professor of statistics Wayne Oldford developed an interactive data visualization tool to be used by anyone interested in exploring the timeline and contents of Clinton's publicly available emails accessible online at rshiny.math.uwaterloo.ca/clinton/.
Users can explore the meta data of these emails as well as some summaries of their contents within any chosen time interval using an interactive timeline. Graphical representations of Clinton's most frequent correspondents show her "inner circle" as well as whether they were using a government email or not. The daily volume and time stamps of email sent and received by Clinton, as well as the redaction code used by the US State Department is also displayed, as are the words appearing most frequently in emails of the selected time period. Users may also filter the displays by redaction code and whether the mail was sent or received by Clinton.
“The tool provides visual analytic tools of this email corpus and demonstrates just how much can be discovered about an individual from what is mistakenly regarded as uninformative meta data," said Oldford. “Moreover, the public can reproduce our analyses and see for themselves how such can be revealing, especially when combined with other publicly available sources.”
For example, in their analysis of the data, the researchers found 10 periods of no emails from Clinton while she was Secretary of State, including a sizeable gap between October 30 and November 9, 2012 which coincides with the initial investigation of the Benghazi attack and its aftermath, as well as the 2012 US presidential election.
The tool can only reveal these patterns of no email and cannot explain them. Salahub and Oldford did search Google for news stories around these dates as well as 10 chosen at random. A comparison of these news stories often provided some explanation of the absence of email. The researchers are careful to warn users about confirmation bias and advise that findings with their tool can only supplement more comprehensive analysis.
The researchers scraped the data for Clinton’s email using data from Wikileaks. Wikileaks produced HTML source files for the 32,795 available emails to and from Clinton. Using the open source statistical programming environment R, they collected and processed the data from the emails and attachments. After processing the senders, receivers, time stamps, and which of the nine Freedom of Information Act exemption codes were used for redacted content, the data was parsed further. This included collecting words into stemmed word groups (different forms of the same words such as “stop” and “stopped”) and excluding common and uninteresting words such as “and”, “is”, and “the”.
Salahub and Oldford recently published an article in Significance Magazine about the tool and their findings called About “her emails”.