Project 19 - Finding the Hidden Shape of Complex Data | Women in Computer Science

Graduate mentor's supervisor: Prof. Marina Meila

Real-world datasets can be extremely high-dimensional: images contain many pixels, speech recordings contain many measurements, and scientific datasets may include thousands of variables. However, the important information may depend on only a small number of underlying factors.

This idea is often described by saying that data lie near a low-dimensional manifold—informally, a simpler curved shape hidden inside a much larger space. Discovering this structure could help us visualize data, remove noise, compare examples, and design more reliable machine-learning methods.

This project asks:

When can we learn the hidden shape of data accurately, efficiently, and reliably?

Learning the full manifold can require a large amount of data and computation, especially when observations are noisy. Instead, we may study whether it is enough to preserve only certain useful properties, such as nearby relationships, clusters, smoothness, dimension, or information needed for prediction. Learning less than the full structure may lead to faster and more robust methods.

This project is part of the Reliable Structure Discovery program, aiming to understand large scientific datasets through low dimensional geometric estimation.

Students should have:

some programming experience, preferably in Python or a similar language, with or without LLMs;
solid foundation of linear algebra, multivariable calculus/analysis, and probability
a willingness to explore mathematical ideas and proofs and test them through experiments.

Experience with probability, statistics, machine learning, optimization, or geometry would be desirable. Previous knowledge of differentiable manifolds and/or differential geometry and previous research experience are not expected, but a good plus.

The graduate mentor will learn and discuss the topic alongside the undergraduate students. The project will be a collaborative learning environment in which the team reads, experiments, asks questions, and develops ideas together.

Students (and the graduate mentor) will work in a team of 3–5. During the term, they may:

learn and implement basic dimension-reduction and manifold-learning methods;
create and visualize datasets with simple hidden shapes;
test how noise, sample size, and data dimension affect the methods;
compare which properties different algorithms preserve or lose; and
measure both accuracy and running time.

Students who continue with the project may study how to estimate dimension in noisy data, identify cases where existing methods work or fail, design faster or more robust algorithms, or prove guarantees for structured settings. The new methods developed will be applied to problems from astronomy, chemistry, material science, computational biology, or history.