PhD Defence Notice: Xiaoyu Xu

Thursday, December 12, 2024 11:00 am - 1:00 pm EST (GMT -05:00)

Candidate: Xiaoyu Xu

Title: Perceptual Relationship and Representation Learning for 3D Understanding and Quality Enhancement

Date: December 12, 2024

Time: 11:00 AM

Place: REMOTE ATTENDANCE

Supervisor(s): Wang, Zhou

Abstract:

Three-dimensional (3D) modeling plays a crucial role in a wide variety of real-world applications such as autonomous driving, smart cities, entertainment, education, and video game development. In the past decades, there has been a growing research effort focusing on enhancing 3D understanding and improving the quality of 3D representation and rendering. Nevertheless, existing methods acquire 3D understanding by relying either on low-level features such as object sizes and edges, or on high-level semantics such as object categories, but often overlook the relationship between multiple objects, which are essential for a human perceptual and comprehensive understanding of 3D scenes in a natural environment. Additionally, current deep learning-based methods for quality enhancement of 3D modeling tend to overfit via large-scale neural networks, which limits their efficiency and generalizability. In this thesis, we tackle the problem from several perspectives and propose four novel methods to learn perceptual relationships and efficient representations for 3D understanding and quality enhancement.

To extend 3D understanding such as depth perception from local cues like shapes and semantics to more global cues, we propose a “relationship spatialization” framework, aiming to extract spatial information based on perceptual relationships. The relationship spatialization has two key objectives: (1) identify which perceptual relationships contribute to 3D understanding, and (2) quantify their contributions. Using these spatialized relationship representations, we integrate them into the monocular depth estimation task to evaluate their effectiveness. Experiments on KITTI, NYU v2, and ICL-NUIM datasets show the effectiveness of the relationship spatialization. Moreover, adopting our relationship spatialization framework to the current state-of-the-art depth estimation models leads to marginal improvement on most evaluation metrics.

It is worth noting that perceptual relationships are view-dependent, meaning each relationship is recognized from a fixed viewpoint. This limitation restricts the ability to incorporate rich 3D information from perceptual relationships into novel-view-related 3D tasks, such as \gls{NVS}. To address this issue, we introduce a “visual relationship transformation'” method, which predicts perceptual relationships of an unseen viewpoint, breaking the constraint of view dependency. By capturing the transformed relationship representations, we can incorporate them into novel view synthesis to enhance the 3D understanding.

The 3D understanding enhancement helps learn more accurate depth. In another aspect, we also want to improve the perceptual quality of rendered novel views which rely not only on 3D understandings but also on appearance i.e. RGB values. Leveraging human perceptual preferences, which are sensitive to distortions in 2D images, we design a “human perceptual preference optimization'” framework to improve the quality of 3D renderings. The framework imposes “human perception” as guidance to learn perceptually satisfactory representations. At the same time, the human perception is formulated into a meta-learning objective function to regularize the training process. Evaluation in the novel view synthesis task demonstrates the strong effectiveness of the proposed framework.

Although the above methods improve the quality of 3D modeling from perceptual representations, they overlook the efficiency and generalizability of the representations. To this end, we propose a “hierarchical controllable diffusion model” for learning 3D models, represented as neural radiance fields or point clouds, using a latent diffusion model. The designed Hierarchical controllable diffusion model (HCDM) improves the quality of 3D modeling by representing 3D models as functional parameters that are much fewer than the size of 3D models, allowing for an efficient representation. The HCDM is designed to learn the distribution of these parameters, improving generalizability across data from multiple modalities. The significant performance improvement of the proposed methods is further validated on 2D images and 3D motion data.