MASc Seminar Notice: "Scene Representations for Generalizable Novel View Synthesis" by Youssef Fathi

Thursday, March 23, 2023 9:00 am - 9:00 am EDT (GMT -04:00)

Name: Youssef Fathi

Date: Thursday 23rd of March  2023

Time: 9:00 am : 10:00 am (EST)

Location: online

SupervisorProf. Fakhri Karray

Title: Scene Representations for Generalizable Novel View Synthesis

Abstract:

Novel view synthesis involves generating novel views of a scene when seen from different viewpoints. It offers numerous applications in computer vision domains such as telepresence, virtual reality, re-cinematography, etc. Recent literature work in the field successfully achieved remarkable photo-realistic synthesis results, however, they require per-scene optimization settings and densely sampled input views which is not easily attainable in practice. Developing lightweight generalizable view synthesis systems with sparse input views would make them more applicable for direct consumer usage. Novel view synthesis poses several difficulties, such as addressing obscured regions, broadening the range of viewing directions, sidestepping suboptimal per-scene optimization configurations, and depicting intricate multi-human scenarios. Tackling those challenges depends on the representation used to model the 3D structure of the scenes. Explicit 3D representations utilize different techniques to explicitly model the scene structure. One example is multi-plane images (MPIs) that segment the scene into a set of parallel planes giving it the ability to effectively handle occlusions. Implicit neural representations, such as Neural Radiance Fields, enable the encapsulation of 3D scene structure within the weights of a neural network, thereby facilitating a 360-degree range of viewing directions and photorealistic synthesis results. However, a promising avenue of research would be to explore the combination of implicit and explicit representations in order to harness their advantages and address more challenging scenarios. In this thesis, we focus on layered scene representations that blend explicit and implicit properties at either the pixel or object level in a generalizable manner. One example of the pixel-level representation is Multi-plane Neural Radiance Fields (MINE), which combines multi-plane images with Neural Radiance Fields for efficient and generalizable novel view synthesis. However, current literature only examines single-view settings for MINE, which limits its viewing range. Our work conducts a thorough technical analysis of the capabilities of single-view MINE and proposes a new Multi-plane NeRF architecture that accepts multiple views to improve synthesis results and expand the viewing range. Additionally, existing methods for handling complex multi-human scenes rely on per-scene optimization settings, making them impractical for real-world use. To address this, we propose a novel object-level layered scene representation named GenLayNeRF that can generate novel views of scenes with close human interactions while generalizing to new human subjects and poses. Furthermore, there is a scarcity of open-source datasets for multi-human view synthesis. To fill this gap, we create two new datasets, ZJU-MultiHuman and DeepMultiSyn, which contain scenes with close human interactions. These datasets are used to evaluate our performance against generalizable and per-scene baselines. The results indicate that our proposed approach outperforms generalizable and non-human per-scene NeRF methods while performing at par with layered per-scene methods without test time optimization.