PhD Defence • Increasing the Efficiency of High-Recall Information Retrieval

Wednesday, April 10, 2019 9:00 am - 9:00 am EDT (GMT -04:00)

Haotian Zhang, PhD candidate
David R. Cheriton School of Computer Science

This paper tackles the challenge of accurately and efficiently estimating the number of relevant documents in a collection for a particular topic. One real-world application is estimating the volume of social media posts (e.g., tweets) pertaining to a topic, which is fundamental to tracking the popularity of politicians and brands, the potential sales of a product, etc. Our insight is to leverage active learning techniques to find all the “easy” documents, and then to use sampling techniques to infer the number of relevant documents in the residual collection. 

We propose a simple yet effective technique for determining this “switchover” point, which intuitively can be understood as the “knee” in an effort vs. recall gain curve, as well as alternative sampling strategies beyond the knee. We show on several TREC datasets and a collection of tweets that our best technique yields more accurate estimates (with the same effort) than several alternatives.