TL;DR
This paper investigates the vulnerability of Mamba-based state space models to Hidden State Poisoning Attacks, revealing their susceptibility and proposing interpretability insights for mitigation.
Contribution
It introduces the HiSPA attack, evaluates its impact on Mamba models using the RoBench-25 benchmark, and extends findings to Mamba-2 and hybrid models, with interpretability analysis for defense.
Findings
Mamba models are vulnerable to HiSPA attacks causing information loss.
RoBench-25 effectively evaluates model robustness against HiSPA.
Interpretability analysis suggests potential mitigation strategies.
Abstract
State space models (SSMs) like Mamba offer efficient alternatives to Transformer-based language models, with linear time complexity. Yet, their adversarial robustness remains critically unexplored. This paper studies the phenomenon whereby specific short input phrases induce a partial amnesia effect in such models, by irreversibly overwriting information in their hidden states, referred to as a Hidden State Poisoning Attack (HiSPA). Our benchmark RoBench-25 allows evaluating a model's information retrieval capabilities when subject to HiSPAs, and confirms the vulnerability of SSMs against such attacks. Even the recent Jamba-1.7-Mini SSM--Transformer (a 52B hybrid model) collapses on RoBench-25 under some HiSPA triggers, whereas pure Transformers do not. We also observe that HiSPA triggers significantly weaken the Jamba model on the popular Open-Prompt-Injections benchmark, unlike pure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
