H-Node Attack and Defense in Large Language Models

Eric Yocam; Varghese Vaidyan; Yong Wang

arXiv:2603.26045·cs.LG·March 30, 2026

H-Node Attack and Defense in Large Language Models

Eric Yocam, Varghese Vaidyan, Yong Wang

PDF

TL;DR

This paper introduces H-Node ANC, a framework for identifying, amplifying, and defending against hallucination signals in large language models at the hidden-state dimension level, improving robustness with minimal performance loss.

Contribution

It develops a mechanistic approach to localize hallucination signals in LLMs, and proposes an adaptive defense method that significantly reduces hallucination-related activation drift.

Findings

01

High-variance dimensions called H-Nodes are linked to hallucinations with 0.90 AUC.

02

Adversarial attack amplifies H-Nodes with less than 10% visibility to the defender.

03

Adaptive ANC reduces activation drift by 33-42% and recovers up to 0.69 robustness.

Abstract

We present H-Node Adversarial Noise Cancellation (H-Node ANC), a mechanistic framework that identifies, exploits, and defends hallucination representations in transformer-based large language models (LLMs) at the level of individual hidden-state dimensions. A logistic regression probe trained on last-token hidden states localizes hallucination signal to a small set of high-variance dimensions -- termed Hallucination Nodes (H-Nodes) -- with probe AUC reaching 0.90 across four architectures. A white-box adversarial attack amplifies these dimensions at inference time via a real-time forward hook, achieving a selectivity of 3.02x with less than 10% visibility to the defender. Adaptive ANC defense suppresses H-Node excess in-pass using confidence-weighted cancellation, reducing grounded activation drift by 33-42% over static cancellation. A dynamic iterative extension that re-ranks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.