Please note: This PhD seminar will take place online.
Nandan Thakur, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
Benchmarks are ever so necessary to measure realistic progress within Information Retrieval. However, existing benchmarks quickly saturate as they are prone to overfitting affecting retrieval model generalization. To overcome these challenges, I would present two of my research efforts: BEIR, a heterogeneous benchmark for zero-shot evaluation across specialized domains, and MIRACL, a monolingual benchmark covering a diverse range of languages.
In BEIR, we show that neural retrievers surprisingly struggle to generalize zero-shot on specialized domains due to a lack of training data. To overcome this, we develop GPL that distills cross-encoder knowledge using cross-domain BEIR synthetic data. On the language side, MIRACL is robust in annotations and includes a broader coverage of the languages. However, generating supervised training data is cumbersome in realistic settings. To supplement, we construct SWIM-IR, a synthetic training dataset with 28 million LLM-generated pairs across 37 languages to develop multilingual retrievers comparable to supervised models on three multilingual retrieval benchmarks and can be extended to several new languages.