Graduate mentor's supervisor: Prof. Marina Meila
Many datasets contain hundreds or thousands of variables, even though only a small number may be useful for prediction. Sparse regression methods, such as the LASSO, try to build accurate models while selecting only a few important variables. This can make models simpler and easier to interpret.
However, these methods can behave in surprising ways. Noise may make unimportant variables appear useful, and variables containing similar information may be selected inconsistently. Existing mathematical guarantees often rely on assumptions that are difficult to verify or unrealistic for real data.
So, we want to know
Why does sparse regression often work well in practice, even when standard theory does not clearly apply?
We will explore this question using geometry, statistics, optimization, and computational experiments. We may first study simpler settings where variables form groups, are connected through a network, or have other known relationships. These cases may help us understand when sparse regression is accurate, stable, and reliable, and eventually lead to more general results.
This project is part of the Reliable Structure Discovery program, aiming to understand large scientific datasets through sparse non-linear interpretations.
Students should have:
- experience with programming in Python or another language with or without the help of LLMs;
- Good knowledge of linear algebra, probability, statistics, calculus/mathematical analysis and
- willingness to play with mathematical arguments and proofs.
Courses in probability, statistics, optimization, or machine learning would be helpful but are not required. Previous research experience is not expected. Students may focus more on programming and experiments or on mathematical examples and proofs, depending on their interests.
Students will work in a team of 3–4. During the term, they may:
- learn and implement basic sparse-regression methods;
- create datasets with noise or strongly related variables;
- test how small changes in the data affect the selected variables;
- visualize simple examples and compare experiments with theory; and
- study structured cases involving groups or networks of variables.
Students who continue with the project may help identify conditions that make sparse regression reliable, prove results for special cases, weaken existing assumptions, or design more stable methods. The new methods developed will be applied to problems from astronomy, chemistry, material science, computational biology, or history.