Selective LoRA for Visual Tokens and Attention Heads
Tiange Luo, Lajanugen Logeswaran, Jaekyeom Kim, Justin Johnson, Honglak Lee

TL;DR
Image-LoRA is a novel parameter-efficient fine-tuning method for vision-language models that selectively updates visual tokens and specific attention heads, reducing computation while maintaining performance.
Contribution
It introduces a token-, head-, and value-selective LoRA approach tailored for vision tasks, improving efficiency and effectiveness over standard LoRA.
Findings
Matches or approaches standard LoRA performance on localization benchmarks.
Reduces trainable parameters and FLOPs in image-token-heavy regimes.
Maintains pure-text performance on GSM8K and improves with a stronger information bottleneck.
Abstract
Low-rank adaptation (LoRA) is widely used for parameter-efficient fine-tuning, but its standard all-token, all-head design ignores the heterogeneous structure of vision language model (VLM) inputs. We introduce \emph{Image-LoRA}, a vision-oriented PEFT recipe that views LoRA as a token-level residual update and applies this update only to visual tokens. Image-LoRA further restricts adaptation to the value path of a compact subset of attention heads, selected using a one-pass influence estimate from a rank-1 visual-token-only probe. This token-, head-, and value-selective design reduces trainable parameters and adapter-only training FLOPs while leaving the pure-text forward pass of the frozen backbone unchanged when no visual tokens are present. Across visual localization benchmarks with controlled text:image token ratios, Image-LoRA matches or closely approaches standard LoRA, while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
