H-Node Attack and Defense in Large Language Models
Eric Yocam, Varghese Vaidyan, Yong Wang

TL;DR
This paper introduces H-Node ANC, a framework for identifying, amplifying, and defending against hallucination signals in large language models at the hidden-state dimension level, improving robustness with minimal performance loss.
Contribution
It develops a mechanistic approach to localize hallucination signals in LLMs, and proposes an adaptive defense method that significantly reduces hallucination-related activation drift.
Findings
High-variance dimensions called H-Nodes are linked to hallucinations with 0.90 AUC.
Adversarial attack amplifies H-Nodes with less than 10% visibility to the defender.
Adaptive ANC reduces activation drift by 33-42% and recovers up to 0.69 robustness.
Abstract
We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions -- termed Hallucination Nodes (H-Nodes) -- with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
