PhD Seminar Notice: Perceptual Relationship and Representation Learning for 3D Understanding and Quality Enhancement

Wednesday, November 6, 2024 9:30 am - 10:30 am EST (GMT -05:00)

Candidate: Xiaoyu Xu

Date: November 6, 2024

Time: 9:30 AM

Location: EIT 3142   

Supervisor: Wang, Zhou

Abstract:
3D modeling plays a crucial role in industries such as autonomous driving, augmented reality, and virtual reality, as well as in academic fields like computer graphics. Increasing research efforts are focused on enhancing 3D understanding and improving the quality of 3D rendering from captured 2D images, which makes 3D modeling more applicable and efficient. However, these works acquire 3D understanding by relying either on low-level information of 2D images, such as object shapes and edges, or on high-level semantics, such as object categories. They overlook the correlations between multiple objects, which are essential for providing a more comprehensive understanding of 3D information in a natural scene. Additionally, current deep learning-based methods for enhancing 3D modeling quality tend to overfit via large-scale neural networks, which limits their efficiency and generalizability. In this thesis, we propose four novel methods to learn perceptual relationships and efficient representations to enhance 3D understanding and improve its quality.

To extend 3D understanding such as depth perception from local cues like shapes and semantics to more global cues, we propose a “relationship spatialization” framework. This framework aims to extract spatial information based on perceptual relationships. The relationship spatialization has two key objectives: (1) identify which perceptual relationships contribute to 3D understanding, and (2) quantify their contributions. Using these spatialized relationship representations, we integrate them into the monocular depth estimation task to evaluate their effectiveness.

However, perceptual relationships are view-dependent, meaning each relationship is recognized from a fixed viewpoint. This limitation restricts the ability to incorporate rich 3D information from perceptual relationships into novel-view-related 3D tasks, such as novel view synthesis (NVS). To address this issue, we introduce a “visual relationship transformation” method, which predicts perceptual relationships of an unseen viewpoint, breaking the constraint of view dependency. By transforming these relationship representations, we can incorporate them into novel view synthesis to enhance the 3D understanding.

The 3D understanding enhancement helps learn more accurate depth. In another aspect, we also want to improve the perceptual quality of rendered 2D images which relies not only on 3D understandings but also on appearance i.e. RGB values. Leveraging human perceptual preferences, which are sensitive to distortions in 2D images, we design a “human perceptual preference optimization” framework to improve the quality of 3D renderings. The framework imposes “human perception” as guidance to learn perceptually satisfactory representations. At the same time, the human perception is formulated into a meta-learning objective function to regularize the training. The effectiveness of the framework is evaluated in the novel view synthesis task.

Although the above methods improve the quality of 3D modeling from perceptual representations, they ignore the efficiency and generalizability of the representations. To this end, we propose a “hierarchical controllable diffusion model (HCDM)” for learning 3D models, represented as neural radiance fields or point clouds, using a latent diffusion model. The designed HCDM improves the quality of 3D modeling by representing 3D models as functional parameters. Meanwhile, the functional parameters are much smaller than the size of 3D models, which allows for an efficient representation. The HCDM is designed to learn the distribution of these parameters, improving generalizability across data from multiple modalities. The performance of the proposed methods is further evaluated on 2D images and 3D motion data.

In summary, this thesis aims to learn both effective and efficient representations to enhance 3D understanding and quality. To improve 3D understanding, it investigates perceptual relationships in both single-view and multi-view scenarios. Subsequently, human perception representations are applied to enhance the quality of 3D modeling. Furthermore, functional representations are employed to improve not only the quality but also the efficiency and generalizability of 3D models.