How a Bit Becomes a Story: Semantic Steering via Differentiable Fault Injection
Zafaryab Haider, Md Hafizur Rahman, Shane Moeykens, Vijay Devabhaktuni, Prabuddha Chakraborty

TL;DR
This paper introduces BLADE, a gradient-based framework to identify and manipulate critical bits in LLM weights, enabling controlled semantic shifts in image captioning outputs while maintaining grammatical correctness.
Contribution
It presents the first differentiable fault analysis method for semantic steering in generative vision-language models, revealing how low-level bit flips can alter high-level meaning.
Findings
Gradient-based sensitivity estimates identify semantically critical bits.
Bit flips can significantly change caption semantics without affecting grammar.
The method enables targeted semantic manipulation of model outputs.
Abstract
Hard-to-detect hardware bit flips, from either malicious circuitry or bugs, have already been shown to make transformers vulnerable in non-generative tasks. This work, for the first time, investigates how low-level, bitwise perturbations (fault injection) to the weights of a large language model (LLM) used for image captioning can influence the semantic meaning of its generated descriptions while preserving grammatical structure. While prior fault analysis methods have shown that flipping a few bits can crash classifiers or degrade accuracy, these approaches overlook the semantic and linguistic dimensions of generative systems. In image captioning models, a single flipped bit might subtly alter how visual features map to words, shifting the entire narrative an AI tells about the world. We hypothesize that such semantic drifts are not random but differentiably estimable. That is, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
