Projects - search

Filter by:

Limit to posts tagged with one or more of:

Understanding this hidden structure could help us visualize data, remove noise, compare examples, and build machine-learning systems that are faster, more reliable, and easier to understand.

In this project, we will try to answer: When can we discover the hidden shape of data accurately and efficiently?

This is a difficult problem. In the most general setting, learning the full shape may require a very large amount of data and computation. Real data are also noisy, so observations may not lie exactly on a clean surface. Even deciding how many underlying dimensions the data have can be challenging.

Tags: Python, Basic Programming, Linear Algebra, Calculus, Statistics, Machine Learning, Optimization, All Years

This project wants to answer: Why does sparse regression often work well in practice, even when the usual theoretical assumptions do not clearly apply?

We will study this question using ideas from geometry, statistics, and optimization. Here, geometry means thinking about variables as directions or points in space. For example, two variables that contain almost the same information can be viewed as pointing in nearly the same direction. This viewpoint may help us understand when sparse regression makes reliable predictions, when it selects meaningful variables, and when its answer is unstable.

Tags: Python, Basic Programming, Linear Algebra, Statistics, Calculus, Optimization, Machine Learning, All Years

In this project, we will explore how machine learning can help astronomers find and study interesting objects or events. For example, a model might be used to classify astronomical objects, identify unusual observations, detect rare events, study populations of galaxies or galaxy clusters, or uncover patterns in the shape and organization of these systems. It may also help researchers understand the different stages or components of events such as gamma-ray bursts. The exact scientific question will depend on the available datasets and discussions with collaborators in astronomy and cosmology. There are opportunities to collaborate with astrophysicists and cosmologists in institutes like Perimeter Institute and Vera Rubin Observatory in medium and/or longer term.


Tags: Python, Basic Programming, Data Structures, Algorithms, Statistics, Linear Algebra, Calculus, Machine Learning, Astronomy, All Years

LLM-based agents are now used to write tests and fix bugs in real software.  They sometimes succeed but when they fail, we usually have no idea why. Every agent leaves a full step-by-step log of what it did, called a trajectory. Many of these logs are now public, but almost no one has sat down and studied them carefully. This project analyzes those logs to understand how agents actually work in generating tests, where they get stuck, and what makes some tasks harder than others. This matters because developers are starting to trust these tools with real work. If we understand how and when they fail, we can build better tools and know when their output needs a second look. Recent public benchmarks like SWT Bench and SWE Atlas, built on real open-source projects, release these trajectories openly, so the data is ready to use.


Tags: Basic Programming, Python, Artificial Intelligence, All Years

This project explores how AI can help understand bilingual doctor–patient conversations and automatically generate accurate medical documentation. It has the potential to improve healthcare accessibility and reduce documentation workload for clinicians serving multilingual populations. We have already build 280 hours speech corpus containing code-switched Kazakh-Russian medical data.  We now collecting an additional 100 hours of simulated doctor and patient conversations to improve model performance.

Tags: Basic Programming, Python, Artificial Intelligence, Machine Learning, Natural Language Processing, Data Science, All Years

This project aims to enhance a research platform for creating and analyzing interactive, web-based data visualization studies by adding an eye-tracking analysis toolkit. Eye-tracking can help researchers understand where users focus, how they analyze problems, and how they make decisions while interacting with websites and data visualizations. However, analyzing gaze data often requires expensive commercial software. This project aims to address that challenge by developing an open and accessible toolkit for analyzing common gaze measures from recorded user studies. By simplifying gaze analysis, the toolkit could support the development of adaptive visualization systems that respond to users’ needs and difficulties.


Tags: Human Computer Interaction (HCI), Python, React, All Years

In this project, students will evaluate a recently proposed CDC algorithm and compare it against state-of-the-art techniques. Working in teams, students will use open-source implementations and real-world datasets to conduct benchmarking experiments and measure metrics such as chunking throughput, chunk-size distributions, and deduplication efficiency. Team members will focus on complementary tasks, including experiment design, dataset preparation, benchmarking, data analysis, and visualization. The primary goal during the program is to evaluate the algorithm and understand its strengths and limitations. Students who continue beyond the program may also explore integrating the algorithm into an open-source benchmarking framework and investigating further improvements to chunking techniques.


Tags: Systems, Algorithms, C/C++, Go, Python, All Years

AI coding agents can attempt real compiler work, but they stumble on implementing optimizations: asked to add a rewrite rule to LLVM's InstCombine pass, they often produce patches that miscompile programs, break tests, or land in the wrong place, and our benchmarking shows agents fail many such tasks. The open question is what feedback closes the gap: when the agent is handed a correctness counterexample, a profitability estimate, or a regression result, does its success rate improve, and which helps most? This project answers that on a fixed open model in a fully observable loop.


Tags: Compilers, Artificial Intelligence, Python, Command Line, C/C++, 2nd Year +, Experienced 1st Years

When a compiler crashes, the program that triggered it is often thousands of lines long, yet almost none of them matter to the failure. "Program reduction" tools automatically shrink such inputs to a tiny reproducing example by repeatedly deleting pieces and re-testing. The major algorithms (Delta Debugging, Hierarchical Delta Debugging, Perses, ProbDD) are closely related and even reuse one another, but each is its own separate program, so they are hard to compare head-to-head or mix and match. This project builds a clean open-source framework where the candidate-generation strategy and the inner reduction algorithm are each a swappable plug-in. With every algorithm running on one shared engine, they can be compared on equal footing and recombined in new ways

Tags: Compilers, Python, Data Structures, Rust, 2nd Year +, Experienced 1st Years

Every time you compile a C or C++ program, the compiler quietly rewrites your code thousands of times to make it faster, e.g. "x + 0 -> x". In LLVM (behind Clang, Swift, and Rust), one pass called InstCombine performs an enormous share of these rewrites. We have built an open-source tool, instcombine-debugger, that patches LLVM to record every transformation InstCombine performs. This project extends that tool to capture richer traces, turning an opaque, heavily-used optimizer into something we can observe and understand.

Tags: Compilers, Python, Command Line, C/C++, 2nd Year +, Experienced 1st Years