Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals

Nathaniel Oh; Paul Attie

arXiv:2603.26829·cs.LG·March 31, 2026

Squish and Release: Exposing Hidden Hallucinations by Making Them Surface as Safety Signals

Nathaniel Oh, Paul Attie

PDF

TL;DR

The paper introduces Squish and Release, a novel activation-patching framework for detecting and managing hallucinations in language models, validated on a comprehensive benchmark.

Contribution

It presents a model-agnostic activation-patching architecture with a safety detector and core, along with a new benchmark for order-gap hallucination detection.

Findings

01

Cascade collapse is nearly total at 99.8% compliance.

02

The safety detector is localized to specific layers with high effectiveness.

03

Engineered cores can release 76.6% of collapsed chains.

Abstract

Language models detect false premises when asked directly but absorb them under conversational pressure, producing authoritative professional output built on errors they already identified. This failure - order-gap hallucination - is invisible to output inspection because the error migrates into the activation space of the safety circuit, suppressed but not erased. We introduce Squish and Release (S&R), an activation-patching architecture with two components: a fixed detector body (layers 24-31, the localized safety evaluation circuit) and a swappable detector core (an activation vector controlling perception direction). A safety core shifts the model from compliance toward detection; an absorb core reverses it. We evaluate on OLMo-2 7B using the Order-Gap Benchmark - 500 chains across 500 domains, all manually graded. Key findings: cascade collapse is near-total (99.8% compliance at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.