Geometry-Lite: Interpretable Safety Probing via Layer-Wise Margin Geometry
Woo Seob Sim, Yu Rang Park

TL;DR
Geometry-Lite is a compact, interpretable probe that analyzes layer-wise margin geometry in large language models to understand safety evidence formation and stability across benchmarks.
Contribution
It introduces Geometry-Lite, a novel method for decomposing safety signals into layer-wise margin geometries, improving interpretability over existing single-layer probes.
Findings
Safety evidence is mainly expressed through persistent boundary-position geometry.
Finite-difference drift and structural summaries contribute little to AUROC.
Linear boundaries are sharp on training, but class-conditional geometry is more stable under shift.
Abstract
Prompt-level safety probes for large language models use hidden-state representations to separate safe from unsafe prompts, but strong average detection performance does not explain the geometry of this separation. In particular, it remains unclear how safety evidence is formed across layers, which aspects of that layer-wise geometry support low-false-positive decisions, and which geometric biases remain stable under benchmark shift. We study this as an empirical decomposition problem and introduce Geometry-Lite, a compact prompt-level probe that maps each layer's final prompt-token representation to signed margins under centroid, local-neighborhood, and supervised linear-boundary readouts, then summarizes the resulting margin profiles by boundary position, layer-to-layer change, and coarse shape. Across nine instruction-tuned backbones (B--B) and seven safety benchmarks,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
