From Hallucinations to Jailbreaks: Rethinking the Vulnerability of Large Foundation Models
Haibo Jin, Peiyan Zhang, Peiran Wang, Man Luo, Haohan Wang

TL;DR
This paper presents a unified theoretical framework linking hallucinations and jailbreak vulnerabilities in large foundation models, showing that defenses against one can influence the other and highlighting the need for joint robustness strategies.
Contribution
It introduces a novel unified framework modeling both vulnerabilities, with empirical validation, revealing their shared underlying mechanisms and implications for defense strategies.
Findings
Hallucinations and jailbreaks share similar optimization dynamics.
Mitigating one vulnerability can reduce the success of the other.
Shared attention dynamics drive both vulnerabilities.
Abstract
Large foundation models (LFMs) are susceptible to two distinct vulnerabilities: hallucinations and jailbreak attacks. While typically studied in isolation, we observe that defenses targeting one often affect the other, hinting at a deeper connection. We propose a unified theoretical framework that models jailbreaks as token-level optimization and hallucinations as attention-level optimization. Within this framework, we establish two key propositions: (1) \textit{Similar Loss Convergence} - the loss functions for both vulnerabilities converge similarly when optimizing for target-specific outputs; and (2) \textit{Gradient Consistency in Attention Redistribution} - both exhibit consistent gradient behavior driven by shared attention dynamics. We validate these propositions empirically on LLaVA-1.5 and MiniGPT-4, showing consistent optimization trends and aligned gradients. Leveraging…
Peer Reviews
Decision·Submitted to ICLR 2026
The work stands out for its novel theoretical formulation that connects hallucinations (internal factual drift) and jailbreaks (external adversarial manipulation) The theoretical derivations are rigorous, internally consistent, and well justified. The proofs in the appendices are mathematically sound, showing clear logical progression from assumptions to conclusions. The findings are significant for the broader AI robustness and safety community. By establishing that hallucinations and jailbreak
The framework is purely correlation-based—it shows aligned optimization but does not fully establish causal mechanisms linking hallucination suppression to jailbreak resistance. More ablation or visualization would make the connection more interpretable. Figures lack error bars or variance analysis. Since the optimization experiments involve gradient-based runs with potential stochasticity, reporting results over multiple seeds would strengthen reproducibility. The theoretical assumptions (e.g
* The core hypothesis that hallucinations and jailbreaks are two manifestations of a single, shared failure mode in LFM optimization is highly original and significant. If proven rigorously, this insight would fundamentally change how robustness research is approached. * The results in Section 5.4, showing that mitigation techniques like OPERA and VCD (developed for hallucination) significantly reduce Attack Success Rate (ASR) against jailbreaks (and vice versa), are compelling and provide the s
* The hallucination loss $\mathcal{L}^{hallu}$ (Eq. 9) is defined as guiding attention toward a fixed target position t (an input token) while suppressing attention elsewhere. This is an extremely artificial and simplifying proxy for real-world hallucination, which involves the model generating new, inaccurate output tokens not directly supported by the input context. By forcing optimization to concentrate attention on a pre-defined input token, the authors are testing an engineered attention pr
- Understanding and mitigating both of the vulnerabilities considered in the paper constitute very relevant and active research questions - The paper is clearly written, easy to read, and reasonably self-contained.
**Proposition 2.1 is imprecise.** - To relate $\mathcal{L}^{adv}(y^\star)$ and $\mathcal{L}^{hallu}(A_{ij}^\Delta)$, one needs to establish a relationship between their arguments. This is missing from the statement of Proposition 4.1. What arguments is it supposed to hold for? - The theoretical result, titled “Similar Loss Convergence”, implies that the two objectives “converge” to the same value. However, the asymptotic regime considered in the proposition is rather unusual, and the conclusion
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCrime, Illicit Activities, and Governance · Big Data Technologies and Applications
MethodsSoftmax · Attention Is All You Need
