Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Woo Seob Sim; Yu Rang Park

arXiv:2605.20241·cs.LG·May 21, 2026

Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry

Woo Seob Sim, Yu Rang Park

PDF

TL;DR

Geometry-Lite is a compact, interpretable probe that analyzes layer-wise margin geometry in large language models to understand safety evidence formation and stability across benchmarks.

Contribution

It introduces Geometry-Lite, a novel method for decomposing safety signals into layer-wise margin geometries, improving interpretability over existing single-layer probes.

Findings

01

Safety evidence is mainly expressed through persistent boundary-position geometry.

02

Finite-difference drift and structural summaries contribute little to AUROC.

03

Linear boundaries are sharp on training, but class-conditional geometry is more stable under shift.

Abstract

Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones ( $1.2$ B-- $70$ B) and seven safety benchmarks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.