The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models
Akshay Paruchuri, Ishan Chatterjee, Henry Fuchs, Ehsan Adeli, Piotr Didyk

TL;DR
This paper investigates modal competition in multimodal language models, revealing language dominance over vision, and introduces centroid replacement as a diagnostic and corrective tool to improve visual reasoning accuracy.
Contribution
It uncovers the structural imbalance favoring language in multimodal models and proposes centroid replacement and contrastive decoding to diagnose and mitigate this issue.
Findings
Erasing text centroid structure reduces accuracy 4 times more than visual centroid erasure.
Contrastive decoding against text-centroid-erased references improves accuracy by up to 16.9%.
Standard fine-tuned models benefit more from the intervention than preference-optimized models.
Abstract
Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4 more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
