LLMs taking shortcuts in test generation: A study with SAP HANA and LevelDB
Vekil Bekmyradov, Noah C. P\"utz, and Thomas Bartz-Beielstein

TL;DR
This study examines how large language models generate software tests, revealing they often rely on shortcuts and perform poorly on complex, unseen systems like SAP HANA, emphasizing the need for better evaluation methods.
Contribution
It introduces a combined cognitive and empirical testing framework to evaluate LLM reasoning in software test generation across open-source and proprietary systems.
Findings
LLMs perform well on familiar open-source benchmarks.
LLMs struggle with complex, unseen domains like SAP HANA.
Current LLMs tend to prioritize compilability over semantic correctness.
Abstract
Large Language Models (LLMs) have achieved impressive results on public benchmarks, often leading to claims of advanced reasoning and understanding. However, recent research in cognitive science reveals that these models sometimes rely on shallow heuristics and memorization, taking shortcuts rather than demonstrating genuine cognitive abilities. This paper investigates LLM behavior in automated test generation for software, contrasting performance on an open-source system (LevelDB) with SAP HANA, one of the most widely deployed commercial database systems worldwide, whose proprietary codebase is guaranteed to be absent from training data. We combine cognitive evaluation principles, drawing on Mitchell's mechanism-focused assessment methodology, with empirical software testing, employing mutation score and iterative compiler-feedback repair loops to assess both accuracy and underlying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
