Invisible Hands: Gray-Box Bit Flip Attack for Steering LLMs Without Knowledge of Gradients, Data, and Weights
Abeer Matar A. Almalky, Ziyan Wang, Mohaiminul Al Nahian, Li Yang, Adnan Siraj Rakin

TL;DR
This paper introduces Invisible Hands, a gray-box bit flip attack method that efficiently compromises large language models without needing access to gradients, data, or weights.
Contribution
It presents the first gray-box BFA framework for LLMs that estimates vulnerability without data or gradient access, reducing memory overhead and increasing practicality.
Findings
Effective attack achieved with minimal weight perturbations
Scales efficiently across six open-source LLMs
Reduces memory overhead compared to white-box methods
Abstract
In recent years, large language models (LLMs) have achieved remarkable advances and are increasingly deployed in critical applications across diverse domains. This growing adoption raises urgent concerns about their security and robustness. In this work, we investigate the impact of Bit Flip Attacks (BFAs) on LLMs, which exploit hardware faults to corrupt model parameters, thereby threatening model integrity and performance. Existing BFA studies primarily assume a white-box setting with access to exact model weights and part of the dataset, and rely on progressive gradient-based bit-search strategies to identify vulnerable bits in model weights. However, gradient computation for LLMs is computationally expensive and memory intensive. In addition, assuming access to exact victim model weights and datasets is challenging due to increasingly strict user privacy regulations. To address…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
