PhD Seminar • Information Retrieval | Natural Language Processing | Deep Learning • Heterogenous Benchmarking across Domains and Languages: The Key to Enable Meaningful Progress in IR Research | Cheriton School of Computer Science

Wednesday, February 7, 2024 12:30 pm - 1:30 pm EST (GMT -05:00)

Please note: This PhD seminar will take place online.

Nandan Thakur, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Jimmy Lin

Benchmarks are ever so necessary to measure realistic progress within Information Retrieval. However, existing benchmarks quickly saturate as they are prone to overfitting affecting retrieval model generalization. To overcome these challenges, I would present two of my research efforts: BEIR, a heterogeneous benchmark for zero-shot evaluation across specialized domains, and MIRACL, a monolingual benchmark covering a diverse range of languages.

In BEIR, we show that neural retrievers surprisingly struggle to generalize zero-shot on specialized domains due to a lack of training data. To overcome this, we develop GPL that distills cross-encoder knowledge using cross-domain BEIR synthetic data. On the language side, MIRACL is robust in annotations and includes a broader coverage of the languages. However, generating supervised training data is cumbersome in realistic settings. To supplement, we construct SWIM-IR, a synthetic training dataset with 28 million LLM-generated pairs across 37 languages to develop multilingual retrievers comparable to supervised models on three multilingual retrieval benchmarks and can be extended to several new languages.