Please note: This master’s thesis presentation will take place online.
Zhiheng Lyu, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Wenhu Chen
Large language models (LLMs) have demonstrated remarkable capabilities in code understanding and generation, yet a significant gap remains between static code generation and interactive software engineering. This thesis investigates the post-training of LLMs as software engineering agents, focusing on three interconnected challenges: infrastructure, data, and training methodology.
First, we contribute to VerlTool, a unified framework for agentic reinforcement learning with tool integration (ARLT), where the author’s contributions center on the training orchestration layer—the stateful environment protocol, environment server architecture, and SWE agent post-training pipeline—which make tool-augmented RL training practical and accessible for researchers. Second, we address the critical bottleneck of training data and evaluation infrastructure: SWE-Next provides a scalable, Ray-native pipeline for synthesizing verifiable software engineering tasks from open-source repositories (ongoing work with intermediate results reported), while for SWE-QA-Pro, a representative benchmark for code question answering, the author contributes the data sourcing and synthesis pipeline. Third, we investigate the post-training design space for software engineering agents, spanning supervised fine-tuning (SFT), rejection fine-tuning (RFT), RL from AI feedback (RLAIF), and RL with verifiable rewards (RLVR). Through three complementary case studies—code question answering (SFT + RLAIF), web-based information retrieval (SFT + RFT), and repository-level bug fixing (RLVR)—we demonstrate that the optimal training recipe depends on task characteristics such as reward verifiability, exploration complexity, and data availability. Our experiments show that task-specific post-training of smaller open-weight models can be competitive with larger proprietary models, and that matching the training method to the task structure is more important than uniformly applying all stages.
Attend this master’s thesis presentation virtually on Google Meet.