GrAInS: Gradient-based Attribution for Inference-Time Steering of LLMs and VLMs
Duy Nguyen, Archiki Prasad, Elias Stengel-Eskin, Mohit Bansal

TL;DR
GrAInS introduces a gradient-based, inference-time steering method for LLMs and VLMs that enhances control over model outputs by identifying influential tokens and adjusting activations without retraining.
Contribution
It presents a novel, interpretable approach using Integrated Gradients for token attribution to steer models at inference time across multimodal tasks.
Findings
Achieves 13.22% accuracy improvement on TruthfulQA with Llama-3.1-8B.
Reduces hallucination rates on MMHal-Bench from 0.624 to 0.514.
Improves alignment win rates on SPA-VL by 8.11%.
Abstract
Inference-time steering methods offer a lightweight alternative to fine-tuning large language models (LLMs) and vision-language models (VLMs) by modifying internal activations at test time without updating model weights. However, most existing approaches rely on fixed, global intervention vectors, overlook the causal influence of individual input tokens, and fail to leverage informative gradients from the model's logits, particularly in multimodal settings where visual and textual inputs contribute unevenly. To address these limitations, we introduce GrAInS, an inference-time steering approach that operates across both language-only and vision-language models and tasks. GrAInS uses contrastive, gradient-based attribution via Integrated Gradients to identify the top-k most influential tokens, both positively and negatively attributed based on their contribution to preferred versus…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsElevator Systems and Control
