LLM Performance for Code Generation on Noisy Tasks
Radzim Sendyka, Christian Cabrera, Andrei Paleyes, Diana Robinson, Neil Lawrence

TL;DR
This paper examines how large language models perform on highly obfuscated code generation tasks, revealing their reliance on memorization and highlighting challenges for benchmarking and safety evaluation.
Contribution
It introduces the concept of eager pattern matching, analyzes performance decay under obfuscation, and discusses implications for benchmarking and safety in LLMs.
Findings
LLMs can solve highly obfuscated tasks beyond human comprehension.
Performance decays differently on contaminated versus unseen datasets.
Obfuscation reveals reliance on memorization over reasoning.
Abstract
This paper investigates the ability of large language models (LLMs) to recognise and solve tasks which have been obfuscated beyond recognition. Focusing on competitive programming and benchmark tasks (LeetCode and MATH), we compare performance across multiple models and obfuscation methods, such as noise and redaction. We demonstrate that all evaluated LLMs can solve tasks obfuscated to a level where the text would be unintelligible to human readers, and does not contain key pieces of instruction or context. We introduce the concept of eager pattern matching to describe this behaviour, which is not observed in tasks published after the models' knowledge cutoff date, indicating strong memorisation or overfitting to training data, rather than legitimate reasoning about the presented problem. We report empirical evidence of distinct performance decay patterns between contaminated and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Topic Modeling · Adversarial Robustness in Machine Learning
