AISafetyBenchExplorer: A Metric-Aware Catalogue of AI Safety Benchmarks Reveals Fragmented Measurement and Weak Benchmark Governance
Abiodun A. Solanke

TL;DR
AISafetyBenchExplorer is a comprehensive catalogue of 195 AI safety benchmarks highlighting fragmentation, lack of standardization, and the need for better governance in AI safety measurement.
Contribution
The paper introduces AISafetyBenchExplorer, a structured, meta-analytical catalogue that organizes and analyzes AI safety benchmarks to improve measurement coherence.
Findings
Benchmark proliferation exceeds measurement standardization.
Most benchmarks are medium-complexity and English-only.
Many repositories and datasets are stale or inconsistently maintained.
Abstract
The rapid expansion of large language model (LLM) safety evaluation has produced a substantial benchmark ecosystem, but not a correspondingly coherent measurement ecosystem. We present AISafetyBenchExplorer, a structured catalogue of 195 AI safety benchmarks released between 2018 and 2026, organized through a multi-sheet schema that records benchmark-level metadata, metric-level definitions, benchmark-paper metadata, and repository activity. This design enables meta-analysis not only of what benchmarks exist, but also of how safety is operationalized, aggregated, and judged across the literature. Using the updated catalogue, we identify a central structural problem: benchmark proliferation has outpaced measurement standardization. The current landscape is dominated by medium-complexity benchmarks (94/195), while only 7 benchmarks occupy the Popular tier. The workbook further reports…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
