Please note: This master’s thesis presentation will take place online.
Ali Falahati, Master’s candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Lukasz Golab
Reinforcement Learning from Human Feedback (RLHF) is widely used to align generative models with human preferences. However, most work studies alignment as a one-time procedure applied to a fixed dataset. In practice, training data is dynamic. Over time, generative models begin to train on curated outputs produced by earlier generations, creating a feedback loop that leads to recursive retraining. In this setting, alignment is a dynamic process in which curation decisions compound over time and continually shape the support, diversity, and alignment profile of future models. This thesis develops a framework for studying how alignment evolves over time under recursive retraining, focusing on how heterogeneous preferences interact through Bradley-Terry style pairwise comparison mechanisms used in curation.
The thesis studies two cases of recursive curation. We begin by revisiting prior work on single-preference curation, which shows that repeatedly optimizing for a fixed preference can lead to degradation in quality, loss of diversity, and collapse toward a narrow subset of outputs. These findings have raised concerns that recursive training loops inevitably reinforce a single dominant preference over time.
Moving beyond this, we study settings where multiple preferences jointly curate the data at each retraining step. Instead of reinforcing a single preference, the training data reflects a mixture of competing preferences. We show that in such settings, recursive retraining can maintain a range of desirable behaviors rather than collapsing, and the resulting models reflect a stable balance between different preferences.
Second, the thesis analyzes recursive retraining with sequential curation by different stakeholders, a setting that reflects how alignment is applied in practice. In real-world settings, model outputs are not curated by a single preference but are curated in stages by different actors, such as model developers and end users, each with their own preferences. This raises a fundamental question: when different preferences curate sequentially over generations, how does the order and structure of curation shape the long-term behavior of the model?
We show that the order in which preferences are applied plays a critical role. Recursive retraining can lead to consensus collapse, compromise on shared outcomes, or asymmetric influence where one stakeholder’s preferences dominate over time. These dynamics highlight that alignment is not only determined by which preferences are present, but also by how they are introduced and reinforced across generations.
Overall, we show that the long-term behavior of aligned generative models is not fixed, but depends on the structure of the retraining process. Alignment should therefore be understood as a mechanism design problem, where the way preferences are aggregated determines whether models collapse, compromise, or remain pluralistic.