Project 20 - Finding the Hidden Shape of Complex Data

Graduate Mentor: Hanzhe Wu

Graduate mentor's supervisor: Prof. Marina Meila

Many real-world datasets look extremely complicated. An image may contain millions of pixels, a speech recording may contain thousands of measurements, and a scientific experiment may produce a very large number of variables. However, the most important information in these datasets may depend on only a much smaller number of underlying factors.

For example, pictures of the same object may differ mainly because of its position, angle, lighting, or size. Although each picture contains many pixels, the collection of all such pictures may have a much simpler hidden shape. In machine learning, this idea is often described by saying that data lie near a low-dimensional manifold. A manifold can be thought of informally as a curved surface or shape hidden inside a much larger space.

Understanding this hidden structure could help us visualize data, remove noise, compare examples, and build machine-learning systems that are faster, more reliable, and easier to understand.

In this project, we will try to answer: When can we discover the hidden shape of data accurately and efficiently?

This is a difficult problem. In the most general setting, learning the full shape may require a very large amount of data and computation. Real data are also noisy, so observations may not lie exactly on a clean surface. Even deciding how many underlying dimensions the data have can be challenging.

Instead of trying to recover every detail, we may ask whether it is enough to preserve only the information needed for a particular purpose. For example, can we preserve:

  • which data points are close to one another;
  • the main directions in which the data vary;
  • the clusters and/or connected regions;
  • smooth changes between nearby examples;
  • the approximate dimension of the data; or
  • the information needed for classification or prediction?

Learning only part of the structure may lead to methods that are faster, need fewer data points, and are less sensitive to noise. We may also investigate when dimension reduction would remove useful information and how to avoid losing important features of the original data.

Students (and the graduate mentor) will work in a team of 3–5. Different team members may focus on programming, experiments, visualization, mathematical examples, or theoretical analysis.

Short-term goals may include:

  • learning the basic ideas behind dimension reduction and manifold learning;
  • creating simple datasets with known shapes, such as curves or surfaces;
  • implementing and comparing existing methods;
  • visualizing what different algorithms discover;
  • testing how the methods change when noise is added;
  • measuring their accuracy and running time; and
  • discussing research papers or introductory materials as a team.

Medium-term goals may include:

  • identifying types of data on which particular methods work well;
  • studying how much data an algorithm needs;
  • investigating how noise affects estimates of shape or dimension;
  • comparing methods that recover a full manifold with methods that preserve only selected properties;
  • finding examples where common methods fail; and
  • developing small improvements to make an algorithm faster or more reliable.

Longer-term directions may include:

  • designing algorithms that learn only the information needed for a specific task;
  • developing more robust methods for noisy data;
  • providing mathematical explanations or guarantees for when a method works;
  • preserving important properties such as smoothness or neighborhood relationships;
  • studying geometric structure directly in the original high-dimensional space; and
  • applying the methods to more realistic scientific or machine-learning datasets.

Students will gain experience in machine learning, algorithms, scientific programming, visualization, mathematical reasoning, reading research papers, and working as part of a research team. The project is especially suitable for students who enjoy exploring open questions and learning new ideas through a combination of discussion, theory, and computation.

This project is intended for undergraduate students who are interested in machine learning, mathematics, algorithms, or data visualization.

Students should have:

  • some programming experience, preferably in Python or a similar language, with or without LLMs;
  • solid foundation of linear algebra, multivariable calculus/analysis, and probability
  • a willingness to explore mathematical ideas and proofs and test them through experiments.

Experience with probability, statistics, machine learning, optimization, or geometry would be desirable. Previous knowledge of differentiable manifolds and differential geometry and previous research experience are not expected, but a good plus.

You will not be expected to understand all of the theory at the beginning. The graduate mentor will also learn and discuss the topic together with the undergraduate team. The project will therefore be organized as a collaborative learning experience in which students and the mentor read, experiment, ask questions, and develop ideas together, under the guidance of the advisor. Team members will be encouraged to explain concepts to one another and contribute according to their different interests and strengths.