Spectral Guardrails for Agents in the Wild: Detecting Tool Use Hallucinations via Attention Topology
Valentin No\"el

TL;DR
This paper introduces a spectral analysis method for detecting hallucinations in autonomous agents' tool use, achieving high recall without training data and revealing that hallucinations correlate with attention noise, thus enhancing safety.
Contribution
It presents a training-free spectral guardrail based on attention topology that effectively detects hallucinations across models, revealing new insights into model failure states.
Findings
Spectral features achieve up to 98.2% recall in hallucination detection.
Single-layer spectral features act as near-perfect hallucination detectors.
Spectral analysis reveals hallucinations as attention noise, indicating thermodynamic state changes.
Abstract
Deploying autonomous agents in the wild requires reliable safeguards against tool use failures. We propose a training free guardrail based on spectral analysis of attention topology that complements supervised approaches. On Llama 3.1 8B, our method achieves 97.7\% recall with multi-feature detection and 86.1\% recall with 81.0\% precision for balanced deployment, without requiring any labeled training data. Most remarkably, we discover that single layer spectral features act as near-perfect hallucination detectors: Llama L26 Smoothness achieves 98.2\% recall (213/217 hallucinations caught) with a single threshold, and Mistral L3 Entropy achieves 94.7\% recall. This suggests hallucination is not merely a wrong token but a thermodynamic state change: the model's attention becomes noise when it errs. Through controlled cross-model evaluation on matched domains (, , same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · EEG and Brain-Computer Interfaces · Big Data and Digital Economy
