Location
MC 5501
Speaker
Francesco Cagnetta, Theoretical and Scientific Data Science Group at SISSA
Title
From Data Statistics to Scaling Laws: Toward a Physics of Representation Learning
Abstract
The successes of modern learning systems largely stem from their ability to learn representations: coarse-grained descriptions of data that retain predictive information while discarding irrelevant microscopic details. Approximation theory helps explain why deep architectures can represent such structure efficiently, while mechanistic interpretability has begun to reveal what these systems encode in practice. Yet we still lack a predictive theoretical framework---a “physics” of representation learning---that explains how useful representations emerge during training and how they depend on the statistical structure of the data.
In this talk, I will describe a model-based approach toward such a framework, inspired by the physics of complex systems: isolate robust structural properties of real data in analytically controlled settings, derive quantitative predictions, and test them in realistic machine-learning scenarios. As a main example, I will show how this perspective leads to a predictive theory of Neural Scaling Laws: the ubiquitous power-law relationships between a machine-learning model's performance and its training resources, such as dataset size. In particular, I will argue that the scaling exponents of modern transformer-based language models trained on real text corpora can be derived from measurable statistical properties of language. By linking learning curves to measurable structure in data, this approach turns scaling into a quantitative probe---and a concrete target---for a physics of representation learning. More generally, it suggests a route toward similarly predictive descriptions of other emergent properties of modern learning systems.