Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Junjie Chu; Xinyue Shen; Ye Leng; Michael Backes; Yun Shen; Yang Zhang

arXiv:2603.04459·cs.CR·May 18, 2026

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Junjie Chu, Xinyue Shen, Ye Leng, Michael Backes, Yun Shen, Yang Zhang

PDF

3 Reviews

TL;DR

This study systematically evaluates 31 LLM safety benchmarks, revealing significant deficiencies in code quality, ethical considerations, and reproducibility, which impact safety assessments and community adoption.

Contribution

It provides the first comprehensive analysis of benchmark code quality, ethical practices, and adoption factors, offering actionable recommendations for improvement.

Findings

01

Only 39% of repositories can run without modification.

02

Few benchmarks include ethical considerations despite harmful content.

03

Benchmark adoption correlates with author prominence and runnability.

Abstract

The rapid expansion of research in LLM safety presents challenges in tracking advancements, making benchmarks important evaluation infrastructures for identifying key trends and facilitating systematic comparisons. Yet no systematic assessment exists of their code quality and runnability, nor of what factors are associated with the community's adoption of certain benchmarks over others. To address this gap, we conduct a systematic measurement study of 31 LLM safety benchmarks (covering prompt injection, jailbreak, and hallucination) with 382 non-benchmark papers as a control group, combining automated static analysis, human runnability testing (220+ person-hours), and bibliometric analysis. We find that only 39\% of benchmark repositories can run without modification, only 16\% provide flawless installation guides, and a mere 6\% include ethical considerations despite containing…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 8Confidence 4

Strengths

* The paper is well written and structured well. * The methodology is rigorous, and I trust the conclusions. * The conclusions are interesting and, I would guess, probably valid in general and not only for LLM safety benchmarks, although none that have ever looked at resarch code will be surprised to learn that resarch code often is not of high quality (the incentives are not there). Everyone that publishes a benchmark should take note that making it easy to run increasesits scientific impact

Weaknesses

I do not find many weaknesses in this study. On the contrary, I find it very rigorous and trustworthy. My main concern is whether the topic is too narrow, as it is a benchmark of LLM safety benchmarks. It is a bit on the side of representation learning, so I am not sure the community would value this study despite its many strong qualities. My strong belief is that this type of meta studies that informs us, the AI community, what constitutes good resarch and what influences impact are importa

Reviewer 02Rating 6Confidence 5

Strengths

Strengths - Timely topic focusing on the safety of LLM benchmark - Broader impact towards the research community and well as industry practioners who deploy or develop LLM

Weaknesses

Weakness. - Any insights of security researchers would be insightful

Reviewer 03Rating 2Confidence 4

Strengths

The paper presents several interesting and valuable findings. Notably, some of the ethical and reproducibility-related metrics—such as only 39% of repositories being ready-to-use, 16% including flawless installation guides, and a mere 6% addressing ethical considerations—highlight the need for researchers to pay more attention to open-sourcing and maintaining their code alongside their research contributions. Additionally, the observation that author's h-index does not show a strong correlat

Weaknesses

Much of the experimental design seems to rely on somewhat imprecise metrics and a fair amount of manual inspection, which makes the paper’s motivation a bit hard to follow. Additionally, the conclusions, experimental design, and motivation appear somewhat subjective, reflecting the authors’ own perspective rather than broader community evidence. It might be more appropriate to frame this part as an initial motivation supported by larger-scale community surveys rather than just the authors’ judg

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware Engineering Research · Scientific Computing and Data Management · Academic integrity and plagiarism