When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions

Maya Larbi; Amal Akli; Mike Papadakis; Rihab Bouyousfi; Maxime Cordy; Federica Sarro; Yves Le Traon

arXiv:2507.20439·cs.SE·July 29, 2025

When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions

Maya Larbi, Amal Akli, Mike Papadakis, Rihab Bouyousfi, Maxime Cordy, Federica Sarro, Yves Le Traon

PDF

TL;DR

This study empirically evaluates how state-of-the-art code generation models perform when faced with ambiguous, incomplete, or contradictory task descriptions, revealing significant performance drops and error patterns.

Contribution

It introduces a new benchmark dataset with realistic task description flaws and systematically analyzes model robustness and failure modes across different model sizes and architectures.

Findings

01

Minor description flaws significantly reduce performance

02

Contradictory descriptions lead to logical errors

03

Larger models are more resilient but still vulnerable

Abstract

Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice, task descriptions frequently exhibit ambiguity, incompleteness, or internal contradictions. In this paper, we present the first empirical study examining the robustness of state-of-the-art code generation models when faced with such unclear task descriptions. We extend the HumanEval and MBPP benchmarks by systematically introducing realistic task descriptions flaws through guided mutation strategies, producing a dataset that mirrors the messiness of informal developer instructions. We evaluate multiple LLMs of varying sizes and architectures, analyzing their functional correctness and failure modes across task descriptions categories. Our findings reveal that even minor imperfections in task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.