Haotian Zhang, PhD candidate
David R. Cheriton School of Computer Science
This paper tackles the challenge of accurately and efficiently estimating the number of relevant documents in a collection for a particular topic. One real-world application is estimating the volume of social media posts (e.g., tweets) pertaining to a topic, which is fundamental to tracking the popularity of politicians and brands, the potential sales of a product, etc. Our insight is to leverage active learning techniques to find all the “easy” documents, and then to use sampling techniques to infer the number of relevant documents in the residual collection.
We propose a simple yet effective technique for determining this “switchover” point, which intuitively can be understood as the “knee” in an effort vs. recall gain curve, as well as alternative sampling strategies beyond the knee. We show on several TREC datasets and a collection of tweets that our best technique yields more accurate estimates (with the same effort) than several alternatives.
200 University Avenue West
Waterloo, ON N2L 3G1
Canada