PhD Defence • Artificial Intelligence | Machine Learning • Improving OOD Detection, Classification, and Reasoning via Multi-modal Feature Alignment

Friday, May 8, 2026 1:00 pm - 4:00 pm EDT (GMT -04:00)

Please note: This PhD defence will take place online.

Yimu Wang, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Krzysztof Czarnecki

The deployment of artificial intelligence (AI) systems in real-world scenarios, such as autonomous vehicles encountering novel road conditions and medical AI analyzing rare pathologies, requires robust handling of out-of-distribution (OOD) data — inputs that differ from the training distribution due to semantic shift (e.g., novel object categories) or covariate shift (e.g., changes in lighting or sensor noise). Achieving this robustness requires that AI systems progress from detecting OOD data, to classifying novel categories within OOD data, and ultimately reasoning about OOD scenarios to answer user queries — a critical capability for safety-critical applications where understanding unfamiliar situations is essential. This demands increasingly advanced vision-language capabilities, yet current models face technical barriers in multi-modal feature alignment that limit practical deployment in OOD detection, classification, and reasoning. The challenges and applications of multi-modal feature alignment for these tasks have not been fully explored. This thesis makes three key contributions to advance the understanding and application of multi-modal feature alignment in vision-language models (VLMs).

First, we address the limitations of VLMs in OOD detection. We observe that the modality gap between image and text features causes high false positive rates, as OOD samples can exhibit high similarity to in-distribution (ID) text prototypes. To overcome this limitation, we propose a novel few-shot OOD detection method that incorporates ID image prototypes alongside ID text prototypes. Our method introduces the Bias Prompt Generation module to enhance image-text fusion and the Image-Text Contrastive module to reduce the modality gap. This multi-modal prototype approach significantly improves OOD detection accuracy across multiple benchmarks.

Next, we tackle OOD classification through 3D open-vocabulary semantic segmentation, which leverages VLMs to generate point-wise classification results for novel object categories. Due to the lack of large-scale 3D-language data, current methods distill knowledge from pre-trained 2D VLMs into 3D models. However, this distillation is supervised by misaligned 3D-scene-image-to-text data pairs, leading to suboptimal performance. To address this issue, we propose an aligned 3D open-vocabulary semantic segmentation framework with two novel modules: a CLIP-Rewarded Alignment Module that generates high-quality, well-aligned 3D-scene-image-to-text pairs through temperature-based generation and CLIP-rewarded sampling, and an Adaptive Segmentation Module that introduces trainable tokens within the text encoder to adapt it to 3D contexts. This approach significantly outperforms previous methods on representative benchmarks.

Finally, we explore efficient multi-modal feature alignment for OOD reasoning. In real-world applications such as autonomous driving and medical diagnosis, AI systems must not only detect and classify OOD data but also reason about appropriate responses — for instance, determining safe actions when encountering an unexpected obstacle or providing diagnostic insights for rare pathologies. Multi-modal large language models (MLLMs, also known as generative VLMs) offer strong generalization capabilities for such reasoning tasks, but their substantial computational requirements limit practical deployment. To address this, we propose an efficient MLLM that incorporates a novel conditional token reduction module to consolidate visual tokens based on their similarity to text tokens and learnable queries, and a novel mixture of multi-modal experts module with a router that takes both text and visual tokens as input for better switching between different low-rank adaptation (LoRA) experts. The proposed method achieves competitive performance while using significantly fewer visual tokens, enabling efficient OOD reasoning without sacrificing effectiveness.

This thesis demonstrates that systematic improvements in multi-modal feature alignment can address multiple complex OOD challenges, from detection through classification to reasoning. These contributions establish a foundation for the deployment of AI systems in open-world environments, enabling more reliable and scalable AI systems that can robustly handle novel scenarios.


Attend this PhD defence virtually on MS Teams.