From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

Julius Porbeck; Christian Medeiros Adriano; Holger Giese (Hasso Plattner Institute; University of Potsdam; Germany)

arXiv:2604.18309·cs.SE·May 21, 2026

From Program Slices to Causal Clarity: Evaluating Faithful, Actionable LLM-Generated Failure Explanations via Context Partitioning and LLM-as-a-Judge

Julius Porbeck, Christian Medeiros Adriano, Holger Giese (Hasso Plattner Institute, University of Potsdam, Germany)

PDF

1 Repo

TL;DR

This paper systematically studies how different context configurations affect the quality of LLM-generated failure explanations in debugging, emphasizing the importance of targeted, evidence-rich information for causal clarity.

Contribution

It introduces a systematic evaluation of context effects on explanation quality, validating LLM-as-a-judge scores against human ratings across multiple models and configurations.

Findings

01

Evidence-rich, failure-specific artifacts improve explanation quality.

02

Overly large contexts tend to produce vague explanations.

03

Higher explanation scores correlate with better downstream bug fixes.

Abstract

Large language model (LLM)-based debugging systems can generate failure explanations, but these explanations may be incomplete or incorrect. Misleading explanations are harmful for downstream tasks (e.g., bug triage, bug fixing). We investigate how explanation quality is affected by various LLM context configurations. Existing work predominantly treats LLM-generated failure explanations as an ad hoc by-product of debugging or repair workflows, using generic prompting over undifferentiated artifacts such as code, tests, and error messages rather than targeting explanations as a first-class output with dedicated quality assessment. Consequently, existing approaches provide limited support for assessing whether these explanations capture the underlying fault-error-failure mechanism and for actionable next steps, and most techniques instead prioritize task success (e.g., patch correctness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.