When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Amal AKLI, Mike PAPADAKIS, Maxime CORDY, Yves Le TRAON

TL;DR
This study explores how prompt structure and richness affect LLM-based code generation robustness, revealing that richer prompts can mitigate under-specification issues and sometimes improve correctness.
Contribution
It demonstrates that prompt structure significantly influences LLM robustness, with richer prompts reducing sensitivity to under-specification and enabling correctness improvements.
Findings
Robustness varies with prompt structure and task complexity.
Structurally rich prompts mitigate under-specification effects.
Prompt mutations can disrupt misleading cues and improve correctness.
Abstract
Large language models are increasingly used for code generation, yet the correctness of their outputs depends not only on model capability but also on how tasks are specified. Prior studies demonstrate that small changes in natural language prompts, particularly under-specification can substantially reduce code correctness; however, these findings are largely based on minimal-specification benchmarks such as HumanEval and MBPP, where limited structural redundancy may exaggerate sensitivity. In this exploratory study, we investigate how prompt structure, task complexity, and specification richness interact with LLM robustness to prompt mutations. We evaluate 10 different models across HumanEval and the structurally richer LiveCodeBench. Our results reveal that robustness is not a fixed property of LLMs but is highly dependent on prompt structure: the same under-specification mutations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
