Virginia first state to use technology-assisted review to classify publicly released Kaine Administration emails

Monday, October 22, 2018

Technology-assisted review (TAR) — an automated process used to select and prioritize documents for review, pioneered by Research Professor Maura Grossman and Professor Gordon Cormack — was used for the first time by a state archive to classify emails from the administration of former Virginia Governor Tim Kaine for release to the public.

photo of Maura Grossman and Gord Cormack

Research Professor Maura Grossman and Professor Gordon Cormack have developed a technology-assisted review process known as Continuous Active Learning™ or CAL™ that can find substantially all relevant information on a particular subject within gigabytes of electronically stored information.

On October 17, 2018, the Library of Virginia released 26,988 new emails from the Kaine administration.  Tim Kaine served as the Commonwealth of Virginia’s governor from 2006 to 2010.  Since January 2014, the Library of Virginia has made more than 183,000 emails from Kaine’s administration available online to the public, notes the October 18, 2018 entry in Out of the Box, the Library of Virginia’s official blog.

The release of the nearly 27,000 new emails reflects the first time TAR has been used to review email for public release.  The particular technology used by the Library was developed by Grossman and Cormack. Continuous Active Learning™ (CAL™) can find substantially all relevant information on a particular subject within gigabytes of electronically stored information — documents that can range from mass email records like those implicated here, to tens of millions of legal documents. 

CAL™ initially presents the user with the documents most likely to be of interest, followed by those that are somewhat less likely to be of interest or relevance, until no more can be located. Unlike a typical web-search engine, which focuses on identifying a few highly relevant documents, CAL™ uses machine learning to produce a high-recall result — that is, to identify substantially all relevant documents, by refining its understanding about which of the remaining documents are most likely to be of interest, based on the user’s feedback on documents already retrieved. CAL™ learns from the user’s feedback and continues to retrieve documents until no more relevant documents can be found.

Cormack and Grossman have collaborated with the Library of Virginia since 2015, when they used the initial release of email from Governor Kaine’s administration as a benchmark for evaluation at the Total Recall Track of the Text REtrieval Conference (TREC), organized by the National Institute of Standards and Technology. 

In addition to being a research professor in the David R. Cheriton School of Computer Science, Grossman is principal of Maura Grossman Law, an eDiscovery law and consulting firm based in Buffalo, New York. Before joining the University of Waterloo, she was of counsel at the prominent Manhattan law firm Wachtell, Lipton, Rosen & Katz. Cormack is a Professor at the David R. Cheriton School of Computer Science. His research involves high-stakes information retrieval, in which the reliability and thoroughness of retrieval methods is of primary importance.