Benchmarking Large Language Models for ABAP Code Generation: An Empirical Study on Iterative Improvement by Compiler Feedback

Stephan Wallraven; Tim K\"ohne; Hartmut Westenberger; Andreas Moser

arXiv:2601.15188·cs.SE·January 22, 2026

Benchmarking Large Language Models for ABAP Code Generation: An Empirical Study on Iterative Improvement by Compiler Feedback

Stephan Wallraven, Tim K\"ohne, Hartmut Westenberger, Andreas Moser

PDF

Open Access

TL;DR

This empirical study evaluates the ability of various Large Language Models to generate correct ABAP code, emphasizing the importance of compiler feedback for iterative improvement and highlighting the performance differences among models.

Contribution

The paper provides the first systematic benchmark of LLMs for ABAP code generation, analyzing their effectiveness and the impact of compiler feedback in an iterative process.

Findings

01

Powerful LLMs achieve around 75% success after iterations

02

Compiler feedback significantly improves code correctness

03

Smaller models perform substantially weaker

Abstract

This work investigates the performance of Large Language Models (LLMs) in generating ABAP code. Despite successful applications of generative AI in many programming languages, there are hardly any systematic analyses of ABAP code generation to date. The aim of the study is to empirically analyze to what extent various LLMs can generate syntactically correct and functional ABAP code, how effectively they use compiler feedback for iterative improvement, and which task types pose special challenges. For this purpose, a benchmark with 180 tasks is conducted, consisting of adapted HumanEval tasks and practical SAP scenarios. The results show significant performance differences between the models: more powerful LLMs achieve success rates of around 75% after several iterations and benefit greatly from compiler feedback, while smaller models perform significantly weaker. Overall, the study…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Software Engineering Research · Natural Language Processing Techniques