PhD Seminar • Information Retrieval • Enhancing Zero-Shot Text Retrieval with Large Language Models

Wednesday, January 17, 2024 12:30 pm - 1:30 pm EST (GMT -05:00)

Please note: This PhD seminar will take place in DC 1304.

Xueguang Ma, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Jimmy Lin

Neural retrieval systems have proven effective across a range of tasks and languages. However, creating fully zero-shot neural retrieval pipeline remains a challenge when relevance labels are not available.

In this presentation, I will introduce two of our works, Hypothetical Document Embeddings (HyDE) and Listwise Reranker with a Large Language Model (LRL), which leverage large language models to enhance text retrieval without the need for human relevance judgement. HyDE uses large language models to generate ‘hypothetical’ documents for a given query. These documents capture relevance patterns but are not real and may contain hallucinations. This hypothetical document is then encoded into an embedding vector by an unsupervised dense retriever, such as Contriever. This vector identifies a neighbourhood in the corpus embedding space, from which similar real documents are retrieved. HyDE significantly outperforms the state-of-the-art unsupervised dense retriever, Contriever, and demonstrates comparable effectiveness as supervised dense retrievers. On the other hand, LRL introduces a listwise reranking paradigm in which a large language model is prompted to generate a reordered list of document identifiers from the given candidate documents. LRL not only outperforms zero-shot pointwise methods when reranking first-stage retrieval results, but it can also function as a final-stage reranker to enhance the top-ranked results of a pointwise method. Experiments on web search and multi-lingual information retrieval datasets show the effectiveness of our proposed methods.