When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Sushant Gautam; Finn Schwall; Annika Willoch Olstad; Fernando Vallecillos Ruiz; Birk Torpmann-Hagen; Sunniva Maria Stordal Bj{\o}rklund; Leon Moonen; Klas Pettersen; and Michael A. Riegler

arXiv:2605.06652·cs.LG·May 8, 2026

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bj{\o}rklund, Leon Moonen, Klas Pettersen, and Michael A. Riegler

PDF

1 Repo

TL;DR

This paper introduces a method for comparing language model safety without ground-truth labels, using scenario-based audits validated through an instrumental-validity chain, demonstrated on Norwegian safety data.

Contribution

It formalizes benchmarkless safety scoring, proposing a validation framework and instantiating it with SimpleAudit and Petri tools for practical deployment.

Findings

01

Safe and abliterated targets separate with AUROC 0.89-1.00

02

Target identity accounts for about 52% of variance

03

Severity profiles stabilize after ten reruns

Abstract

Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kelkalot/simpleaudit
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.