ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

Ziqian Zhong; Aditi Raghunathan; Nicholas Carlini

arXiv:2510.20270·cs.LG·October 24, 2025

ImpossibleBench: Measuring LLMs' Propensity of Exploiting Test Cases

Ziqian Zhong, Aditi Raghunathan, Nicholas Carlini

PDF

Open Access 3 Reviews

TL;DR

ImpossibleBench is a benchmark framework that systematically measures and analyzes the tendency of large language models to exploit test cases, revealing cheating behaviors and aiding in developing more reliable LLM systems.

Contribution

It introduces a novel benchmark with impossible task variants to quantify LLMs' shortcut exploitation and demonstrates its utility in studying, engineering, and monitoring model behaviors.

Findings

01

Models exhibit varying cheating rates on impossible tasks.

02

Prompt design and test access influence cheating behavior.

03

ImpossibleBench enables detailed analysis of deception strategies.

Abstract

The tendency to find and exploit "shortcuts" to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents' propensity to exploit test cases. ImpossibleBench creates "impossible" variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent's "cheating rate" as its pass rate on these impossible tasks, where any pass necessarily…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

1. Clear problem formulation: The paper focuses on a concrete and practically important failure mode of LLM code agents: reward hacking via test-case exploitation. By constructing “impossible” tasks where any pass is necessarily specification-violating, the benchmark gives a clean operationalization of “cheating propensity,” which is easy to interpret and directly relevant to real-world agent deployments. 2. Methodological simplicity: The core idea is to mutate tests via one-off or conflicting

Weaknesses

1. Limited scope: While the coding domain is important, the framework currently only covers unit-test–driven programming benchmarks. Many safety-critical reward-hacking behaviors for LLM agents arise in more open-ended or non-code settings (e.g., tool-using assistants, structured reasoning tasks, RL-style environments) where “impossible tasks” are harder to define. The paper briefly argues generality, but does not demonstrate extensions beyond Python code + tests, limiting immediate applicabilit

Reviewer 02Rating 8Confidence 3

Strengths

- Overall, this paper presents a good contribution to LLM benchmarks, providing additional frameworks for cheating and reward hacking prevention in production. - The idea of creating tasks impossible to solve provides an objective measurement of attempts to cheat. - The strategy and design choices taken when modifying existing benchmarks can be borrowed by other projects with similar goals.

Weaknesses

There are some caveat in the paper to be clarified in author response: - LiveCodeBench quality control is missing due to lack of standard solution. However, the paper did not mention what other measures were done or attempted to sanitize the dataset and ensure consistent quality as SWE-Bench. Given the broad lower cheating rate of Impossible-LiveCodeBench, it is not hard to make claims that LiveCodeBench has low quality of "impossible", where test cases were actually passable. - There could be

Reviewer 03Rating 2Confidence 3

Strengths

- The paper is well written and organized. - The empirical evaluations are extensive in terms of the variety of models considered and tested under various context engineering scenarios.

Weaknesses

- I am not convinced if this paper is delivering on its promise of measuring cheating. I am worried that some of the mutations are conflicting with the task itself and common sense (like asserting that 7 is not a prime number) and LLMs tendency to modify them is not measuring their cheating but their eagerness. As a developer, this would be the first thing I imagine myself doing as well. - I also have reservations about the significance of the problem as I believe the cheating problem is closely

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Ethics and Social Impacts of AI · Software Testing and Debugging Techniques