LLM-Based Automated Diagnosis Of Integration Test Failures At Google

Celal Ziftci; Ray Liu; Spencer Greene; Livio Dalloro

arXiv:2604.12108·cs.SE·April 15, 2026

LLM-Based Automated Diagnosis Of Integration Test Failures At Google

Celal Ziftci, Ray Liu, Spencer Greene, Livio Dalloro

PDF

TL;DR

Auto-Diagnose leverages large language models to efficiently analyze and diagnose integration test failures at Google, significantly reducing diagnosis time and improving developer workflow integration.

Contribution

The paper introduces Auto-Diagnose, an LLM-based tool that automates root cause analysis of integration test failures, integrated into Google's internal review system, demonstrating high accuracy and positive user feedback.

Findings

01

90.14% accuracy in diagnosing root causes

02

Used across 52,635 failing tests at Google

03

Only 5.8% of cases found 'Not helpful'

Abstract

Integration testing is critical for the quality and reliability of complex software systems. However, diagnosing their failures presents significant challenges due to the massive volume, unstructured nature, and heterogeneity of logs they generate. These result in a high cognitive load, low signal-to-noise ratio, and make diagnosis difficult and time-consuming. Developers complain about these difficulties consistently and report spending substantially more time diagnosing integration test failures compared to unit test failures. To address these shortcomings, we introduce Auto-Diagnose, a novel diagnosis tool that leverages LLMs to help developers efficiently determine the root cause of integration test failures. Auto-Diagnose analyzes failure logs, produces concise summaries with the most relevant log lines, and is integrated into Critique, Google's internal code review system,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.