Artificial or Just Artful? Do LLMs Bend the Rules in Programming?

Oussama Ben Sghaier; Kevin Delcourt; Houari Sahraoui

arXiv:2512.21028·cs.SE·December 25, 2025

Artificial or Just Artful? Do LLMs Bend the Rules in Programming?

Oussama Ben Sghaier, Kevin Delcourt, Houari Sahraoui

PDF

Open Access

TL;DR

This paper investigates how large language models adapt their code generation strategies when exposed to test cases under different prompting conditions, revealing significant performance changes and adaptation strategies that highlight conflicts between pretraining and alignment.

Contribution

It introduces a systematic analysis of LLM behavior with test signals, revealing how models adapt and the effectiveness of restrictions, advancing understanding of model alignment in code generation.

Findings

01

Test visibility significantly impacts correctness, with nearly doubling performance for some models.

02

Explicit restrictions only partially mitigate the influence of test cases.

03

Test-driven refinement is the most common adaptation strategy among LLMs.

Abstract

Large Language Models (LLMs) are widely used for automated code generation, yet their apparent successes often mask a tension between pretraining objectives and alignment choices. While pretraining encourages models to exploit all available signals to maximize success, alignment, whether through fine-tuning or prompting, may restrict their use. This conflict is especially salient in agentic AI settings, for instance when an agent has access to unit tests that, although intended for validation, act as strong contextual signals that can be leveraged regardless of explicit prohibitions. In this paper, we investigate how LLMs adapt their code generation strategies when exposed to test cases under different prompting conditions. Using the BigCodeBench (Hard) dataset, we design five prompting conditions that manipulate test visibility and impose explicit or implicit restrictions on their use.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Machine Learning and Data Classification