Benchmarking Large Language Models for ABAP Code Generation: An Empirical Study on Iterative Improvement by Compiler Feedback
Stephan Wallraven, Tim K\"ohne, Hartmut Westenberger, Andreas Moser

TL;DR
This empirical study evaluates the ability of various Large Language Models to generate correct ABAP code, emphasizing the importance of compiler feedback for iterative improvement and highlighting the performance differences among models.
Contribution
The paper provides the first systematic benchmark of LLMs for ABAP code generation, analyzing their effectiveness and the impact of compiler feedback in an iterative process.
Findings
Powerful LLMs achieve around 75% success after iterations
Compiler feedback significantly improves code correctness
Smaller models perform substantially weaker
Abstract
This work investigates the performance of Large Language Models (LLMs) in generating ABAP code. Despite successful applications of generative AI in many programming languages, there are hardly any systematic analyses of ABAP code generation to date. The aim of the study is to empirically analyze to what extent various LLMs can generate syntactically correct and functional ABAP code, how effectively they use compiler feedback for iterative improvement, and which task types pose special challenges. For this purpose, a benchmark with 180 tasks is conducted, consisting of adapted HumanEval tasks and practical SAP scenarios. The results show significant performance differences between the models: more powerful LLMs achieve success rates of around 75% after several iterations and benefit greatly from compiler feedback, while smaller models perform significantly weaker. Overall, the study…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Software Engineering Research · Natural Language Processing Techniques
