Please note: This master’s thesis presentation will take place online.
Max Ku, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Wenhu Chen
The fast development of generative models has started a new era in AI, especially in conditional image synthesis. Since the rise of diffusion models, current models can perform image generation with high fidelity and diversity. This thesis is driven towards controllable generation and manipulation in the image and video domains, guided by the three studies: ImagenHub's role in identifying the controllability of current state-of-the-art image synthesis models, VIEScore to produce explainable metrics in image synthesis tasks, and AnyV2V's role in performing precise video editing.
The first part of this study highlights the evaluation of the image domain. ImagenHub, which tackles the challenge of distinguishing current research to find the best working methods, also standardized the human-centered evaluation in image synthesis research. Complementarily, VIEScore act as a new explainable metric to mimicking human-like evaluation across all conditional image synthesis tasks with multimodal LLMs, tickling the scalability issue of ImagenHub.
The second study focuses on the video domain, which introduces AnyV2V, the first framework to treat video editing as an image editing problem. It leverages the editing power from off-the-shelf image editing models and the generalization power from image-to-video models to perform precise video editing. Such paradigm is training-free and allows video edits in a wide range of applications. Most importantly, we reported the increase in performance when plugging with stronger image-to-video models, highlighting the capacity of AnyV2V for adaptive evolution.
These studies form the basis of this thesis, driving toward robust control in visual generation and manipulation. Through a thorough analysis of ImagenHub and VIEScore, this research not only identifies the current capabilities and limitations of image synthesis models but also sets the stage for future advancements in evaluating image synthesis models. Then with AnyV2V, we align the image editing and video editing problem with image-to-video models, lays the groundwork for future advancements in making video editing more controllable and robust.