MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning
Xinhan Zheng, Huyu Wu, Xueting Wang, Duo Su, Haiyun Jiang

TL;DR
This paper identifies that the bias of multimodal large language models towards text inputs stems from an internal architectural misalignment in attention key spaces, not just external data issues.
Contribution
It introduces MaLoRA, a novel gated modality LoRA method that aligns visual and textual key spaces to improve multimodal reasoning.
Findings
Visual keys are out-of-distribution relative to text keys.
Distributional analysis shows significant inter-modal divergence.
Aligning key spaces reduces modality bias in attention mechanisms.
Abstract
Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
