Is Your Automated Software Engineer Trustworthy?

Noble Saji Mathews; Meiyappan Nagappan

arXiv:2506.17812·cs.SE·June 24, 2025

Is Your Automated Software Engineer Trustworthy?

Noble Saji Mathews, Meiyappan Nagappan

PDF

3 Datasets

TL;DR

This paper introduces BouncerBench, a benchmark to evaluate whether LLM-based software agents can abstain from acting when inputs are vague or outputs are likely incorrect, highlighting current models' lack of trustworthiness.

Contribution

The paper presents BouncerBench, a novel benchmark for assessing the ability of LLM-based software agents to refuse actions under uncertainty, addressing a critical gap in trustworthiness evaluation.

Findings

01

Most models fail to abstain on vague inputs.

02

Models often generate incorrect code patches.

03

Significant room for improvement in model trustworthiness.

Abstract

Large Language Models (LLMs) are being increasingly used in software engineering tasks, with an increased focus on bug report resolution over the past year. However, most proposed systems fail to properly handle uncertain or incorrect inputs and outputs. Existing LLM-based tools and coding agents respond to every issue and generate a patch for every case, even when the input is vague or their own output is incorrect. There are no mechanisms in place to abstain when confidence is low. This leads to unreliable behaviour, such as hallucinated code changes or responses based on vague issue reports. We introduce BouncerBench, a benchmark that evaluates whether LLM-based software agents can refuse to act when inputs are ill-defined or refuse to respond when their own outputs are likely to be incorrect. Unlike prior benchmarks that implicitly incentivize models to generate responses even when…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.