When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
Sushant Gautam, Finn Schwall, Annika Willoch Olstad, Fernando Vallecillos Ruiz, Birk Torpmann-Hagen, Sunniva Maria Stordal Bj{\o}rklund, Leon Moonen, Klas Pettersen, and Michael A. Riegler

TL;DR
This paper introduces a method for comparing language model safety without ground-truth labels, using scenario-based audits validated through an instrumental-validity chain, demonstrated on Norwegian safety data.
Contribution
It formalizes benchmarkless safety scoring, proposing a validation framework and instantiating it with SimpleAudit and Petri tools for practical deployment.
Findings
Safe and abliterated targets separate with AUROC 0.89-1.00
Target identity accounts for about 52% of variance
Severity profiles stabilize after ten reruns
Abstract
Many deployments must compare candidate language models for safety before a labeled benchmark exists for the relevant language, sector, or regulatory regime. We formalize this setting as benchmarkless comparative safety scoring and specify the contract under which a scenario-based audit can be interpreted as deployment evidence. Scores are valid only under a fixed scenario pack, rubric, auditor, judge, sampling configuration, and rerun budget. Because no labels are available, we replace ground-truth agreement with an instrumental-validity chain: responsiveness to a controlled safe-versus-abliterated contrast, dominance of target-driven variance over auditor and judge artifacts, and stability across reruns. We instantiate the chain in SimpleAudit, a local-first scoring instrument, and validate it on a Norwegian safety pack. Safe and abliterated targets separate with AUROC values…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
