Zeynep
Akkalyoncu
Yilmaz,
Master’s
candidate
David
R.
Cheriton
School
of
Computer
Science
Standard bag-of-words term-matching techniques in document retrieval fail to exploit rich semantic information embedded in the document texts. One promising recent trend in facilitating context-aware semantic matching has been the development of massively pretrained language models, culminating in BERT as their most popular example today.
In this work, we propose adapting BERT as a neural re-ranker for document retrieval to achieve large improvements on news articles. Two fundamental issues arise in applying BERT to “ad hoc” document retrieval on newswire collections: relevance judgments in existing test collections are provided only at the document level, and documents often exceed the length that BERT was designed to handle. To overcome these challenges, we compute and aggregate sentence-level evidence to rank documents. The lack of appropriate relevance judgments in test collections is addressed by leveraging sentence-level and passage-level relevance judgments fortuitously available in collections from other domains to capture cross-domain notions of relevance.
Our experiments demonstrate that models of relevance can be transferred across domains. By leveraging semantic cues learned across various domains, we propose a model that achieves state-of-the-art results on three standard TREC newswire collections. We explore the effects of cross-domain relevance transfer, and trade-offs between using document and sentence scores for document ranking. We also present an end-to-end document retrieval system that integrates the open-source Anserini information retrieval toolkit, discussing the related technical challenges and design decisions.