Candidate: Akash Karthikeyan
Date: August 6, 2025
Time: 11:30am
Location: Zoom (link: https://uwaterloo.zoom.us/j/4123583238?pwd=ZU1rUXN6dGhLL0NuS2I4d01rbCswQT09)
Supervisor: Dr. Yash Vardhan Pant
All are welcome!
Abstract:
Generative models have achieved remarkable progress across domains such as vision and language. However, their application to sequential decision-making and planning remains challenging. In reinforcement learning and robotics, agents must operate under long-horizon dependencies, adapt to new tasks and environments, and especially in multi-agent settings respond to adversarial or evolving opponents. Despite the progress in behavioral cloning and offline policy learning, existing approaches often fail to generalize beyond the demonstration distribution or learn robust, interactive behaviors in competitive games. These limitations restrict current systems to narrow tasks, short temporal horizons, and deterministic settings. For instance, behavioral planners trained on single-goal environments struggle with multi-task missions requiring subgoal discovery and adaptive reasoning, as there is no straightforward mechanism for iterative test-time improvement. Similarly, in multi-agent reinforcement learning, standard policy optimization often yields unimodal, brittle strategies that overfit to specific opponents and fail to converge to equilibrium in continuous games. This thesis explores challenges and opportunities in using generative models for planning and decision-making tasks, specifically energy-based and diffusion-based models which serve as both representations and solvers for planning and policy learning. In the single-agent setting, we introduce GenPlan, a discrete-flow planner that reframes planning as iterative denoising over trajectories using an energy-guided diffusion process. This formulation enables task and goal discovery, and generalization to unseen environments. In the multi-agent setting, we propose DiffFSP, a diffusion policy gradient method within the fictitious self-play framework. By learning best responses through diffusion models, DiffFSP captures multimodal strategies, improves convergence speed, and remains robust to evolving opponents in zero-sum continuous games. Our empirical studies show that GenPlan outperforms baselines by over 10% on adaptive planning tasks, generalizing from single-task demonstrations to complex, compositional multi-task missions. Likewise, DiffFSP achieves up to 3 times faster convergence and 30 times higher success rates compared to traditional RL in multi-agent benchmarks. These results demonstrate the potential of generative modeling not only for representation learning, but as a unified substrate for planning, learning, and decision-making across settings.