Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering

Daniel Rodriguez-Cardenas; Xiaochang Li; Marcos Macedo; Antonio Mastropaolo; Dipin Khati; Yuan Tian; Huajie Shao; Denys Poshyvanyk

arXiv:2601.21070·cs.SE·January 30, 2026

Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering

Daniel Rodriguez-Cardenas, Xiaochang Li, Marcos Macedo, Antonio Mastropaolo, Dipin Khati, Yuan Tian, Huajie Shao, Denys Poshyvanyk

PDF

Open Access

TL;DR

This paper introduces BEHELM, a comprehensive benchmarking infrastructure designed to evaluate large language models in software engineering across multiple tasks and quality dimensions, addressing current evaluation gaps.

Contribution

The paper presents BEHELM, a holistic, standardized benchmarking framework that unifies software-scenario specification with multi-metric evaluation for LLMs in software engineering.

Findings

01

Identified core barriers: lack of rich datasets, overreliance on ML metrics, and non-standardized data pipelines.

02

BEHELM enables multi-task, multi-metric assessment across diverse software engineering scenarios.

03

Aims to facilitate fair, realistic, and scalable evaluation of LLMs in software engineering.

Abstract

Large language models for code are advancing fast, yet our ability to evaluate them lags behind. Current benchmarks focus on narrow tasks and single metrics, which hide critical gaps in robustness, interpretability, fairness, efficiency, and real-world usability. They also suffer from inconsistent data engineering practices, limited software engineering context, and widespread contamination issues. To understand these problems and chart a path forward, we combined an in-depth survey of existing benchmarks with insights gathered from a dedicated community workshop. We identified three core barriers to reliable evaluation: the absence of software-engineering-rich datasets, overreliance on ML-centric metrics, and the lack of standardized, reproducible data pipelines. Building on these findings, we introduce BEHELM, a holistic benchmarking infrastructure that unifies software-scenario…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Machine Learning in Materials Science