TL;DR
CLASP is a lightweight, efficient classifier designed to detect hidden state poisoning attacks in state space models like Mamba, ensuring robustness of hybrid language models against adversarial manipulation.
Contribution
Introduces CLASP, a novel token-level classifier using pattern recognition in embeddings to defend SSMs from HiSPAs, with strong generalization and real-time processing capabilities.
Findings
CLASP achieves 95.9% token-level F1 score in detecting malicious tokens.
The model maintains high detection performance under unseen attack patterns.
CLASP operates with low computational resources, suitable for real-world deployment.
Abstract
State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model (Classifier Against State Poisoning) to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening r\'esum\'es to identify the best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
