Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM
Vinith M. Suriyakumar, Ayush Sekhari, Lena Stempfle, Robertson Wang, Michael Simpson, Rebecca Portnoff, Marzyeh Ghassemi, Ashia C. Wilson

TL;DR
This paper introduces Gaussian probing, a non-generative method to evaluate harmful model specialization by analyzing internal representations, especially useful in sensitive domains like CSAM where output-based evaluation is restricted.
Contribution
The paper presents Gaussian probing as a novel, scalable approach to assess model harm without generating outputs, effective in high-risk, legally constrained domains.
Findings
Gaussian probing reliably detects harmful model specialization.
The method distinguishes benign from harmful models without output sampling.
It remains robust against weight rescaling adversarial manipulations.
Abstract
Auditing the fine-tunes of open-weight generative models for harmful specialization has become a new governance challenge for model hosting platforms. The standard toolkit, generative evaluation via curated prompts or red-teaming, does not scale to platform-level auditing and breaks down entirely for domains like CSAM where generation is legally constrained. This motivates the Evaluation without Generation problem: assessing model capabilities without producing outputs. We argue that in such settings, capability must be inferred from the model's state, either its parameters or internal representations, rather than its outputs. We introduce Gaussian probing, a method that characterizes how LoRA adaptors perturb a model's internal representations by measuring responses to Gaussian latent ensembles. Unlike raw-weight baselines, Gaussian probing reliably distinguishes benign from harmful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
