🏆 Distinguished Paper Award — FORGE 2026
When cloud services fail, identifying the root cause requires tracing faults through layers of interdependent microservices—a task increasingly delegated to LLM-based agents. This project presents a controlled empirical evaluation of six open-source LLMs across 48,000 simulated fault scenarios, isolating their reasoning behavior from the complexity of multi-agent pipelines. We introduce a taxonomy of 16 reasoning failure types, automatically annotated at scale using an LLM-as-a-Judge evaluator, and show that specific failures (e.g., anchoring bias and arbitrary evidence selection) are strong predictors of diagnostic error.