MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

Xinhan Zheng; Huyu Wu; Xueting Wang; Duo Su; Haiyun Jiang

arXiv:2510.26721·cs.AI·April 21, 2026

MaLoRA: Gated Modality LoRA for Key-Space Alignment in Multimodal LLM Fine-Tuning

Xinhan Zheng, Huyu Wu, Xueting Wang, Duo Su, Haiyun Jiang

PDF

TL;DR

This paper identifies that the bias of multimodal large language models towards text inputs stems from an internal architectural misalignment in attention key spaces, not just external data issues.

Contribution

It introduces MaLoRA, a novel gated modality LoRA method that aligns visual and textual key spaces to improve multimodal reasoning.

Findings

01

Visual keys are out-of-distribution relative to text keys.

02

Distributional analysis shows significant inter-modal divergence.

03

Aligning key spaces reduces modality bias in attention mechanisms.

Abstract

Multimodal large language models (MLLMs) exhibit a pronounced preference for textual inputs when processing vision-language data, limiting their ability to reason effectively from visual evidence. Unlike prior studies that attribute this text bias to external factors such as data imbalance or instruction tuning, we propose that the bias originates from the model's internal architecture. Specifically, we hypothesize that visual key vectors (Visual Keys) are out-of-distribution (OOD) relative to the text key space learned during language-only pretraining. Consequently, these visual keys receive systematically lower similarity scores during attention computation, leading to their under-utilization in the context representation. To validate this hypothesis, we extract key vectors from LLaVA and Qwen2.5-VL and analyze their distributional structures using qualitative (t-SNE) and quantitative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.