Project 19 - Why Sparse Regression Works? | Women in Computer Science

Graduate mentor's supervisor: Prof. Marina Meila

Suppose we want to predict an outcome with hundreds or even thousands of possible measurements. In many problems, only a small number of those measurements may actually be useful. For example, only a few genes may be important for predicting a biological response, or only a few features may be needed to describe a complex dataset.

Sparse regression is a group of machine-learning methods that tries to make accurate predictions while using only a small number of variables. One widely used example is the LASSO. These methods are popular, because the resulting models can be easier to understand, faster to use, and less likely to depend on irrelevant information.

However, sparse regression can behave in surprising ways. If two variables contain very similar information, the method may choose (only) one of them almost arbitrarily. Small changes in the data may cause it to select a very different set of variables. Noise may also make an unimportant variable appear useful. Although there are mathematical results explaining when sparse regression should work, their assumptions can be difficult to check and may not hold for real datasets.

So, this project wants to answer: Why does sparse regression often work well in practice, even when the usual theoretical assumptions do not clearly apply?

We will study this question using ideas from geometry, statistics, and optimization. Here, geometry means thinking about variables as directions or points in space. For example, two variables that contain almost the same information can be viewed as pointing in nearly the same direction. This viewpoint may help us understand when sparse regression makes reliable predictions, when it selects meaningful variables, and when its answer is unstable.

We may first study simpler cases in which the variables have some known structure. For example:

variables may come in natural groups;
variables may be connected through a network or graph;
several variables may describe the same underlying source of information; or
some variables may have known relationships suggested by the application.

By understanding these structured cases first, we hope to identify conditions under which sparse regression works reliably. We may then try to weaken those conditions and develop explanations that apply more broadly.

Students will work in a team of 3–4 on related but complementary tasks.

Short-term goals may include:

learning the basic ideas behind linear regression, sparse regression, and the LASSO;
implementing sparse-regression methods on small datasets;
creating examples with noise or highly related variables;
testing how the selected variables change when the data are slightly modified;
visualizing simple examples to understand the geometry of the problem; and
comparing what happens in experiments with what existing theory predicts.

Medium-term goals may include:

studying cases where variables form groups, networks, and causal structures;
identifying simple conditions that make sparse regression more reliable;
studying the difference between accurate prediction and correct variable selection; and
proving mathematical results for carefully chosen special cases.

Longer-term directions may include:

weakening the assumptions needed for the mathematical guarantees;
extending the results to more general sparse-learning methods;
designing methods that make use of known groups or relationships between variables;
studying how to make sparse regression more stable; and
developing a clearer explanation of why these methods perform well on real data.

You will gain experience with machine learning, mathematical reasoning, optimization, computational experiments, and research collaboration. This project is especially suitable for students who enjoy moving between theory and computation and asking not only whether an algorithm works, but also why.

This project is intended for undergraduate students who are interested in mathematics, algorithms, or machine learning and who would like to understand why an algorithm works.

Students should have:

experience with programming in Python or another language with or without the help of LLMs;
Good knowledge of linear algebra, probability, statistics, calculus/mathematical analysis and
willingness to play with mathematical arguments and proofs.

Working knowledge in optimization, or machine learning would be desired, but is not strictly required. Previous research experience is not expected. The necessary ideas about regression, sparsity, and optimization will be introduced during the project.

You do not need to begin as specialists in theory. Team members may contribute in different ways: some may focus on experiments and programming, while others may focus on mathematical examples, proofs, or visualizations.