PhD Seminar • Natural Language Processing • Efficient and Robust Neural Retrieval Methods with Pre-Trained Language Models

Wednesday, February 15, 2023 12:00 pm - 1:30 pm EST (GMT -05:00)

Please note: This PhD seminar will take place online.

Minghan Li, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Jimmy Lin

TL;DR: This seminar would cover our EMNLP2022 work about certified error control for relevance ranking, and our recent unpublished work CITADEL which is about efficient and robust multi-vector retrieval. We show that the redundancy in information retrieval and relevance ranking could be eliminated in various ways to obtain both efficient and robust neural retrievers.


In information retrieval (IR), candidate set pruning has been commonly used to speed up two-stage relevance ranking. However, such an approach lacks accurate error control and often trades accuracy against computational efficiency in an empirical fashion, missing theoretical guarantees. In this paper, we propose the concept of certified error control of candidate set pruning for relevance ranking, which means that the test error after pruning is guaranteed to be controlled under a user-specified threshold with high probability. Both in-domain and out-of-domain experiments show that our method successfully prunes the first-stage retrieved candidate sets to improve the second-stage reranking speed while satisfying the pre-specified accuracy constraints in both settings.

Multi-vector retrieval methods combine the merits of sparse (e.g., BM25) and dense (e.g., DPR) retrievers and achieve state-of-the-art performance on various retrieval tasks. These methods, however, are orders of magnitude slower and need more space to store their indexes compared to their single-vector counterparts. In this paper, we unify different multi-vector retrieval models from a token routing viewpoint and propose conditional token interaction via dynamic lexical routing, namely Citadel, for efficient and effective multi-vector retrieval. Citadel learns to route each token vector to the predicted lexical “keys” such that a query token vector only interacts with document token vectors routed to the same key. This design significantly reduces the computation cost while maintaining high accuracy.


To join this PhD seminar on Zoom, please go to https://uwaterloo.zoom.us/j/96169405889.