Please note: This PhD defence will take place in DC 2310 and online.
Nandan Thakur, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
Modern applications increasingly rely on retrieval models and large language models (LLMs) for knowledge-intensive tasks such as information retrieval and question answering. Because LLMs are limited by their parametric knowledge and prone to hallucinations, retrieval-augmented generation (RAG) mitigates these failures by grounding generation in external documents. While pretrained transformers have achieved strong in-domain performance, their zero-shot robustness on tasks across heterogeneous domains, evolving corpora, and diverse languages remains limited. This gap is largely driven by deficiencies in current benchmarks, training data, and evaluation methodologies, which fail to reflect real-world distribution shifts.
This thesis addresses these limitations along three axes: retrieval benchmarks, training data, and RAG evaluation. First, on benchmarks, it revisits Touche 2020, the argument retrieval subset of the BEIR benchmark, and curates the test collection through denoising and post-hoc relevance assessment. Next, it introduces MIRACL, a large-scale multilingual retrieval dataset spanning 18 languages, and proposes FreshStack, an automated framework for building realistic, up-to-date retrieval benchmarks on technical and niche domains. On the training data side, it presents SWIM-IR, a large synthetic multilingual dataset covering 33 languages, and shows that synthetic-only training can match or exceed supervised performance, while further demonstrating that dataset pruning and LLM-based relabeling (RLHN) consistently improve out-of-domain robustness for retrievers.
Next, moving to RAG, the thesis focuses on RAG evaluation by introducing the Ragnarok framework, providing baselines for the TREC 2024 RAG track, validating LLM-based judges for large-scale support assessment, and releasing NoMIRACL for human-annotated multilingual RAG evaluation. It further proposes MIRAGE-Bench, an arena-style multilingual benchmark that trains a surrogate judge from heuristic signals and LLM preferences, achieving high agreement with LLM-based rankings at much lower cost. Overall, this work shows that robust progress in retrieval and RAG depends not only on model scaling, but critically on realistic benchmarks, high-quality training data, and principled, scalable evaluation frameworks.
To attend this PhD defence in person, please go to DC 2310. You can also attend virtually on Zoom.