TL;DR
This paper investigates the modality preference of Omni-modal Large Language Models, revealing a visual bias and providing insights into its emergence and implications for trustworthiness.
Contribution
It introduces a benchmark and metric for quantifying modality preference, and offers a mechanistic understanding and diagnostic tools for OLLMs.
Findings
Most OLLMs exhibit a pronounced visual preference, unlike traditional VLMs.
Modality preference emerges progressively in mid-to-late layers of the models.
Using internal signals, the method effectively diagnoses cross-modal hallucinations.
Abstract
Native Omni-modal Large Language Models (OLLMs) have shifted from pipeline architectures to unified representation spaces. However, this native integration gives rise to a critical yet underexplored phenomenon: modality preference. To bridge this gap, we first systematically quantify modality preference of OLLMs using a newly-curated conflict-based benchmark and the modality selection rate metric. Our evaluation of ten representative OLLMs reveals a notable paradigm shift: unlike the ``text-dominance'' of traditional VLMs, most OLLMs exhibit a pronounced visual preference. To further understand the underlying mechanism, we conduct layer-wise probing and demonstrate that such modality preference is not static but emerges progressively in the mid-to-late layers. Building upon these insights, we leverage these internal signals to diagnose cross-modal hallucinations, achieving competitive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
