Please note: This PhD seminar will take place in DC 3301.
Xueguang Ma, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Jimmy Lin
Information seeking has been fundamental to human advancement, enabling knowledge acquisition, decision-making, and innovation across disciplines. However, traditional information retrieval systems often rely on specialized pipelines optimized for specific retrieval tasks, causing information silos that hinder unified information seeking.
In this talk, I will present our work in building unified document retrieval systems that break these information silos across three dimensions: (1) domain and language silos, where I demonstrate how LLM-based dense retrievers achieve strong generalizability across retrieval tasks and present frameworks for training small, generalizable retrievers through diverse LLM augmentation; (2) modality silos, where I introduce a paradigm shift from text-based retrieval that relies on content extraction to directly encoding document screenshots, preserving all information including text, images, and layout in unified dense representations; and (3) space silos, where we show the importance of LLM-powered search agents in seeking and gathering information across disparate sources, and present fair and transparent evaluation benchmarks for assessing deep-search systems. I will conclude by discussing future directions that further pave the way toward building truly unified retrieval systems for seamless information seeking across world knowledge.