Project 12 - How do AI coding agents write tests, and when do they fail? | Women in Computer Science

Graduate mentor's supervisor: Prof. Mei Nagappan

LLM-based agents are now used to write tests and fix bugs in real software. They sometimes succeed but when they fail, we usually have no idea why. Every agent leaves a full step-by-step log of what it did, called a trajectory. Many of these logs are now public, but almost no one has sat down and studied them carefully. This project analyzes those logs to understand how agents actually work in generating tests, where they get stuck, and what makes some tasks harder than others. This matters because developers are starting to trust these tools with real work. If we understand how and when they fail, we can build better tools and know when their output needs a second look. Recent public benchmarks like SWT Bench and SWE Atlas, built on real open-source projects, release these trajectories openly, so the data is ready to use.

The project splits into three connected parts, so a team of 3-4 can each own one and share findings:

Short-term (this term): read a set of agent trajectories and tag what the agent does at each step, such as reading the report, opening files, running code, or retrying. Agree on a shared labeling guide so everyone tags the same way. Produce a first clear picture of common agent behaviors and the most common ways they fail.
Medium-term: compare different agents on the same tasks and check whether they fail on the same bugs or different ones. This shows whether failure comes from the task being hard or from the agent. Separately, compare easy and hard tasks and look at simple structural differences, like how many files are involved, or what kind of error shows up.
Longer-term (for students who continue past the program): turn the labeled trajectories into a paper and a taxonomy of how agents behave when writing tests.

This project is best suited for students who have taken one or two programming courses and are comfortable reading code and data files in Python. No research experience and no AI background needed. Most of the work is careful reading, organizing, and labeling, so first and second-year students can do it.