Projects - search

Filters

Understanding this hidden structure could help us visualize data, remove noise, compare examples, and build machine-learning systems that are faster, more reliable, and easier to understand.

In this project, we will try to answer: When can we discover the hidden shape of data accurately and efficiently?

This is a difficult problem. In the most general setting, learning the full shape may require a very large amount of data and computation. Real data are also noisy, so observations may not lie exactly on a clean surface. Even deciding how many underlying dimensions the data have can be challenging.

Tags: Python, Basic Programming, Linear Algebra, Calculus, Statistics, Machine Learning, Optimization, All Years

This project wants to answer: Why does sparse regression often work well in practice, even when the usual theoretical assumptions do not clearly apply?

We will study this question using ideas from geometry, statistics, and optimization. Here, geometry means thinking about variables as directions or points in space. For example, two variables that contain almost the same information can be viewed as pointing in nearly the same direction. This viewpoint may help us understand when sparse regression makes reliable predictions, when it selects meaningful variables, and when its answer is unstable.

Tags: Python, Basic Programming, Linear Algebra, Statistics, Calculus, Optimization, Machine Learning, All Years

In this project, we will explore how machine learning can help astronomers find and study interesting objects or events. For example, a model might be used to classify astronomical objects, identify unusual observations, detect rare events, study populations of galaxies or galaxy clusters, or uncover patterns in the shape and organization of these systems. It may also help researchers understand the different stages or components of events such as gamma-ray bursts. The exact scientific question will depend on the available datasets and discussions with collaborators in astronomy and cosmology. There are opportunities to collaborate with astrophysicists and cosmologists in institutes like Perimeter Institute and Vera Rubin Observatory in medium and/or longer term.

Tags: Python, Basic Programming, Data Structures, Algorithms, Statistics, Linear Algebra, Calculus, Machine Learning, Astronomy, All Years

This project asks: can we use visual design to let people navigate information at their own depth? The core idea is progressive disclosure through visual cues. Specifically, using symbols, icons, and glyphs to signal that more detail exists, and revealing that detail only when someone expresses interest (by clicking, hovering, or zooming in). Think of it like a map: at a distance, you see city names; as you zoom in, streets appear; closer still, individual buildings. We want to apply that same logic to arbitrary information.

Tags: Basic Programming, Human Computer Interaction (HCI), Visualization, 3rd Year+

Attention-deficit/hyperactivity disorder (ADHD) affects an estimated 5–10% of children worldwide. Yet existing interventions — medication and clinic-based therapy — remain costly and difficult to access for many families. Neurofeedback training is a non-pharmacological approach with a growing evidence base, but it is currently available almost exclusively in clinical settings.

Our research asks: What should an at-home attention training system look like for families of children with ADHD? We are designing a system that combines an EEG headset, tangible interactive hardware, and gamified training experiences — one that children actually want to use, that parents can meaningfully participate in, and that makes training progress visible and trackable.

Tags: Basic Programming, Figma, Human Computer Interaction (HCI), Psychology, 2nd Year +

Recent experiments have revealed surprisingly large performance variation across repeated executions of some applications, even after taking standard benchmarking precautions. One possible explanation is that ASLR produces memory layouts with significantly different performance characteristics. If so, an important challenge is determining how these layouts differ and identifying the memory-layout properties responsible for the observed performance changes.

A possible research direction is to develop techniques and tools for detecting ASLR-induced performance variation, comparing memory layouts across executions, and identifying the characteristics that distinguish faster and slower runs. Such a tool could potentially build upon HeapLENS and leverage AI-assisted analysis to help explain observed performance differences.

Tags: C/C++, Data Structures, Multithreading, Memory Management, Operating Systems, Systems, 2nd Year +

Professor T. Brown recently developed a system called HeapLENS to help researchers automatically examine the memory layout of multithreaded applications. HeapLENS is specifically designed to produce compact, high-quality, curated output suitable for AI-driven analysis. While HeapLENS output can already enable AI agents to improve application memory layouts by a significant margin, the current workflow invokes HeapLENS only once and uses its output only once. A natural research direction is therefore to adapt HeapLENS to support repeated interaction with an AI agent, enabling an iterative optimization cycle in which incremental changes can be proposed, evaluated, and refined.

Tags: C/C++, Data Structures, Multithreading, Memory Management, Operating Systems, Systems, Artificial Intelligence, 2nd Year +

Professor T. Brown and collaborators recently designed a concurrent version of the van Emde Boas tree that incorporates a number of novel space optimizations and can outperform other state-of-the-art concurrent ordered sets by a large margin. However, this data structure relies on hardware transactional memory (HTM) for synchronization. The goal of this project is to extend this work to universally available synchronization mechanisms for systems without HTM support, with optimistic concurrency control (OCC) being one natural direction.

Tags: C/C++, Data Structures, Multithreading, Systems, 2nd Year +

LLM-based agents are now used to write tests and fix bugs in real software. They sometimes succeed but when they fail, we usually have no idea why. Every agent leaves a full step-by-step log of what it did, called a trajectory. Many of these logs are now public, but almost no one has sat down and studied them carefully. This project analyzes those logs to understand how agents actually work in generating tests, where they get stuck, and what makes some tasks harder than others. This matters because developers are starting to trust these tools with real work. If we understand how and when they fail, we can build better tools and know when their output needs a second look. Recent public benchmarks like SWT Bench and SWE Atlas, built on real open-source projects, release these trajectories openly, so the data is ready to use.

Tags: Basic Programming, Python, Artificial Intelligence, All Years

This project explores how AI can help understand bilingual doctor–patient conversations and automatically generate accurate medical documentation. It has the potential to improve healthcare accessibility and reduce documentation workload for clinicians serving multilingual populations. We have already build 280 hours speech corpus containing code-switched Kazakh-Russian medical data. We now collecting an additional 100 hours of simulated doctor and patient conversations to improve model performance.

Tags: Basic Programming, Python, Artificial Intelligence, Machine Learning, Natural Language Processing, Data Science, All Years

Modern AI and machine learning systems are increasingly trained and deployed on distributed infrastructures consisting of multiple servers working together. While distributed computing enables larger models and faster processing, it also introduces new security challenges. Communication between nodes, shared resources, and distributed coordination mechanisms can create vulnerabilities that may not exist in single-machine systems. The goal of this project is to understand and evaluate security risks that arise when training or running AI/ML models in distributed environments. By identifying and studying these vulnerabilities, we can help build more secure and trustworthy AI systems.

Tags: Networks, Operating Systems, Artificial Intelligence, Machine Learning, Security, Systems, All Years

For secure multiparty computation (MPC), our goal is for parties 1 to n to securely compute f(x1, …, xn) where xi is the private input of party i. Our security condition is for the messages each party sends and receives during the computation of f to reveal no more information than its input and output. This allows the parties to collaboratively compute a function over their private inputs while maintaining privacy.

Traditionally, MPC algorithms have a fixed runtime that depends only on input size rather than the specific input since otherwise the runtime would leak information about the private input. However, for non-private algorithms, there are practical algorithms with a runtime that is both random and low in expectation. One example that has been successfully adapted to the MPC setting is quicksort, which is an algorithm whose random runtime is independent of the input list. Our goal in this project is to adapt another algorithm with random runtime that is independent of the specific input and benchmark it against private deterministic versions of the same algorithm. A successful implementation could enable adaptation of richer algorithm classes to the private setting.

Tags: Algorithms, Statistics, Security, All Years

Many non-private implementations of algorithms often access data structures at indices determined at runtime. Since such indices are determined by the input, revealing such indices would compromise privacy according to our definition. While there are asymptotically efficient solutions to adapt these algorithms to the MPC model, these solutions use generic constructions, and the constant factors make using them impractical.

Tags: Data Structures, Algorithms, Security, All Years

One primitive used to implement MPC algorithms is function secret sharing, which is a way to split a function f among multiple parties such that each party can evaluate f on a common input x and obtain shares of the output f(x). We investigate the use of function secret sharing to implement sorting algorithms in MPC since sorting is a common subroutine in many algorithms. We then benchmark these implementations against state-of-the-art private sorting algorithms.

Tags: Algorithms, Cryptography, C/C++, Security, All Years

Healthcare data can reveal important insights that improve patient care, but analyzing it is challenging. Analysts must explore complex datasets, generate and test hypotheses, and interpret results carefully. While Generative AI can assist by creating code, visualizations, and insights, it does not always understand users' goals and can sometimes produce unreliable results. This project explores how teams of AI agents can collaborate with humans to support healthcare data analysis. We will design new interaction techniques that help people communicate their intent, understand how AI-generated results were produced, and assess whether those results are trustworthy. By making human-AI collaboration more transparent and reliable, this research aims to help healthcare professionals gain insights from data more effectively and make better-informed decisions.

Tags: Web Development, Data Analysis, Human Computer Interaction (HCI), Artificial Intelligence, All Years

This project aims to enhance a research platform for creating and analyzing interactive, web-based data visualization studies by adding an eye-tracking analysis toolkit. Eye-tracking can help researchers understand where users focus, how they analyze problems, and how they make decisions while interacting with websites and data visualizations. However, analyzing gaze data often requires expensive commercial software. This project aims to address that challenge by developing an open and accessible toolkit for analyzing common gaze measures from recorded user studies. By simplifying gaze analysis, the toolkit could support the development of adaptive visualization systems that respond to users’ needs and difficulties.

Tags: Human Computer Interaction (HCI), Python, React, All Years

In this project, students will evaluate a recently proposed CDC algorithm and compare it against state-of-the-art techniques. Working in teams, students will use open-source implementations and real-world datasets to conduct benchmarking experiments and measure metrics such as chunking throughput, chunk-size distributions, and deduplication efficiency. Team members will focus on complementary tasks, including experiment design, dataset preparation, benchmarking, data analysis, and visualization. The primary goal during the program is to evaluate the algorithm and understand its strengths and limitations. Students who continue beyond the program may also explore integrating the algorithm into an open-source benchmarking framework and investigating further improvements to chunking techniques.

Tags: Systems, Algorithms, C/C++, Go, Python, All Years

AI coding agents can attempt real compiler work, but they stumble on implementing optimizations: asked to add a rewrite rule to LLVM's InstCombine pass, they often produce patches that miscompile programs, break tests, or land in the wrong place, and our benchmarking shows agents fail many such tasks. The open question is what feedback closes the gap: when the agent is handed a correctness counterexample, a profitability estimate, or a regression result, does its success rate improve, and which helps most? This project answers that on a fixed open model in a fully observable loop.

Tags: Compilers, Artificial Intelligence, Python, Command Line, C/C++, 2nd Year +, Experienced 1st Years

When a compiler crashes, the program that triggered it is often thousands of lines long, yet almost none of them matter to the failure. "Program reduction" tools automatically shrink such inputs to a tiny reproducing example by repeatedly deleting pieces and re-testing. The major algorithms (Delta Debugging, Hierarchical Delta Debugging, Perses, ProbDD) are closely related and even reuse one another, but each is its own separate program, so they are hard to compare head-to-head or mix and match. This project builds a clean open-source framework where the candidate-generation strategy and the inner reduction algorithm are each a swappable plug-in. With every algorithm running on one shared engine, they can be compared on equal footing and recombined in new ways

Tags: Compilers, Python, Data Structures, Rust, 2nd Year +, Experienced 1st Years

Every time you compile a C or C++ program, the compiler quietly rewrites your code thousands of times to make it faster, e.g. "x + 0 -> x". In LLVM (behind Clang, Swift, and Rust), one pass called InstCombine performs an enormous share of these rewrites. We have built an open-source tool, instcombine-debugger, that patches LLVM to record every transformation InstCombine performs. This project extends that tool to capture richer traces, turning an opaque, heavily-used optimizer into something we can observe and understand.

Tags: Compilers, Python, Command Line, C/C++, 2nd Year +, Experienced 1st Years

Projects - search

Filter by:

Project 20 - Finding the Hidden Shape of Complex Data

Project 19 - Why Sparse Regression Works?

Project 18 - Discovering Patterns and Structures in the Universe with Machine Learning

Project 17 - Visual Progressive Disclosure for Information Overload

Project 16 - Designing Gamified Attention Training for Children with ADHD

Project 15 - AI-Driven Analysis of ASLR-Induced Performance Variation

Project 14 - Iterative AI-Driven Memory Analysis of Concurrent Data Structures

Project 13 - Practical Concurrent Ordered Sets

Project 12 - How do AI coding agents write tests, and when do they fail?

Project 11 - Building AI Systems that Understand Doctor–Patient Conversations

Project 10 - Security analysis of distributed AI/ML systems

Project 9 - Secure Algorithms with Random Runtime

Project 8 - Oblivious Data Structures for Secure Computation

Project 7 - Secure Sorting Using Functional Secret Sharing

Project 6 - Multi-Agent AI for Healthcare Data Sensemaking

Project 5 - Building a Gaze Analysis Toolkit for Accessible Web-Based Eye-Tracking Studies in ReVISitBench

Project 4 - Evaluating Content-Defined Chunking Algorithms for Efficient Deduplication Systems

Project 3 - An Agentic Harness for Implementing Missed Compiler Optimizations

Project 2 - A Unified Framework for Program Reduction Algorithms

Project 1 - A Compiler Optimization Observatory — Instrumenting LLVM at Scale