The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

Akshay Paruchuri; Ishan Chatterjee; Henry Fuchs; Ehsan Adeli; Piotr Didyk

arXiv:2604.14363·cs.CL·April 17, 2026

The Cost of Language: Centroid Erasure Exposes and Exploits Modal Competition in Multimodal Language Models

Akshay Paruchuri, Ishan Chatterjee, Henry Fuchs, Ehsan Adeli, Piotr Didyk

PDF

TL;DR

This paper investigates modal competition in multimodal language models, revealing language dominance over vision, and introduces centroid replacement as a diagnostic and corrective tool to improve visual reasoning accuracy.

Contribution

It uncovers the structural imbalance favoring language in multimodal models and proposes centroid replacement and contrastive decoding to diagnose and mitigate this issue.

Findings

01

Erasing text centroid structure reduces accuracy 4 times more than visual centroid erasure.

02

Contrastive decoding against text-centroid-erased references improves accuracy by up to 16.9%.

03

Standard fine-tuned models benefit more from the intervention than preference-optimized models.

Abstract

Multimodal language models systematically underperform on visual perception tasks, yet the structure underlying this failure remains poorly understood. We propose centroid replacement, collapsing each token to its nearest K-means centroid, as a controlled probe for modal dependence. Across seven models spanning three architecture families, erasing text centroid structure costs 4 $\times$ more accuracy than erasing visual centroid structure, exposing a universal imbalance where language representations overshadow vision even on tasks that demand visual reasoning. We exploit this asymmetry through text centroid contrastive decoding, recovering up to +16.9% accuracy on individual tasks by contrastively decoding against a text-centroid-erased reference. This intervention varies meaningfully with training approaches: standard fine-tuned models show larger gains (+5.6% on average) than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.