Please note: This master’s thesis presentation will take place in DC 2314.
Shivani Upadhyay, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
Evaluating information retrieval (IR) systems requires a reference that captures what correct or relevant output looks like, as well as a mechanism for determining whether a system’s output matches that reference. For lexical retrieval systems, both requirements are relatively straightforward. Systems rank documents by term overlap, pooling produces a judgment file that covers most documents any system is likely to return, and determining relevance reduces to a simple membership test against that file.
This evaluation paradigm relies on the assumption that relevance can be detected through surface-form overlap. When retrieval moves beyond that assumption, the framework begins to break down. Retrieval-augmented generation (RAG) systems strain this setup by synthesizing free-form natural language responses from retrieved evidence. A gold answer set constructed before system execution cannot anticipate every correct phrasing, so even semantically correct outputs can fail under lexical matching. Dense retrieval systems encode queries and documents as vectors, retrieving relevant documents that might not share vocabulary with the query. Under pooling-based evaluation, these documents never receive human judgments and are instead assigned a default relevance grade of zero. Together, these failures highlight the limits of surface-form evaluation and point to the need for judgment mechanisms that reason directly about meaning.
This thesis investigates whether large language models (LLMs) can fill this gap by contributing three frameworks across successive layers of the evaluation pipeline. The first contribution is an open-source QA evaluation framework that combines chain-of-thought (CoT) prompting with self-consistency decoding using instruction-tuned LLMs. When evaluated across 12 systems on NQ-open, it matches zero-shot GPT‑4 in rank correlation with human judgments while using a model more than an order of magnitude smaller, demonstrating that prompting strategy can matter as much as scale. The second contribution is a framework for patching incomplete relevance judgment sets by assigning four-level TREC-style labels to unjudged query-passage pairs via few-shot prompting. When evaluated across five TREC Deep Learning Track collections at removal rates varying from 10 to 90%, it substantially improves system ranking fidelity over the standard practice of treating unjudged documents as non-relevant. The third contribution is UMBRELA, which is a fully automated open-source relevance assessment framework deployed in the TREC 2024 RAG Track across 301 topics, achieving run-level Kendall’s tau >= 0.86 against fully manual assessment.
All frameworks are released as open-source tools to support reproducible and scalable IR evaluation.