TL;DR
This study systematically evaluates 31 LLM safety benchmarks, revealing significant deficiencies in code quality, ethical considerations, and reproducibility, which impact safety assessments and community adoption.
Contribution
It provides the first comprehensive analysis of benchmark code quality, ethical practices, and adoption factors, offering actionable recommendations for improvement.
Findings
Only 39% of repositories can run without modification.
Few benchmarks include ethical considerations despite harmful content.
Benchmark adoption correlates with author prominence and runnability.
Abstract
The rapid expansion of research in LLM safety presents challenges in tracking advancements, making benchmarks important evaluation infrastructures for identifying key trends and facilitating systematic comparisons. Yet no systematic assessment exists of their code quality and runnability, nor of what factors are associated with the community's adoption of certain benchmarks over others. To address this gap, we conduct a systematic measurement study of 31 LLM safety benchmarks (covering prompt injection, jailbreak, and hallucination) with 382 non-benchmark papers as a control group, combining automated static analysis, human runnability testing (220+ person-hours), and bibliometric analysis. We find that only 39\% of benchmark repositories can run without modification, only 16\% provide flawless installation guides, and a mere 6\% include ethical considerations despite containing…
Peer Reviews
Decision·Submitted to ICLR 2026
* The paper is well written and structured well. * The methodology is rigorous, and I trust the conclusions. * The conclusions are interesting and, I would guess, probably valid in general and not only for LLM safety benchmarks, although none that have ever looked at resarch code will be surprised to learn that resarch code often is not of high quality (the incentives are not there). Everyone that publishes a benchmark should take note that making it easy to run increasesits scientific impact
I do not find many weaknesses in this study. On the contrary, I find it very rigorous and trustworthy. My main concern is whether the topic is too narrow, as it is a benchmark of LLM safety benchmarks. It is a bit on the side of representation learning, so I am not sure the community would value this study despite its many strong qualities. My strong belief is that this type of meta studies that informs us, the AI community, what constitutes good resarch and what influences impact are importa
Strengths - Timely topic focusing on the safety of LLM benchmark - Broader impact towards the research community and well as industry practioners who deploy or develop LLM
Weakness. - Any insights of security researchers would be insightful
The paper presents several interesting and valuable findings. Notably, some of the ethical and reproducibility-related metrics—such as only 39% of repositories being ready-to-use, 16% including flawless installation guides, and a mere 6% addressing ethical considerations—highlight the need for researchers to pay more attention to open-sourcing and maintaining their code alongside their research contributions. Additionally, the observation that author's h-index does not show a strong correlat
Much of the experimental design seems to rely on somewhat imprecise metrics and a fair amount of manual inspection, which makes the paper’s motivation a bit hard to follow. Additionally, the conclusions, experimental design, and motivation appear somewhat subjective, reflecting the authors’ own perspective rather than broader community evidence. It might be more appropriate to frame this part as an initial motivation supported by larger-scale community surveys rather than just the authors’ judg
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware Engineering Research · Scientific Computing and Data Management · Academic integrity and plagiarism
