Good Scores, Bad Data: A Metric for Multimodal Coherence
Vasundra Srinivasan

TL;DR
The paper introduces the Multimodal Coherence Score (MCS), a new metric to evaluate the internal consistency of multimodal AI inputs independently of task performance.
Contribution
It proposes a novel, lightweight coherence metric that decomposes multimodal fusion quality into four independent dimensions, validated across multiple architectures and datasets.
Findings
MCS outperforms task accuracy in sensitivity to coherence issues.
Each dimension of MCS responds independently to specific failure modes.
MCS requires no human annotation and can identify what aspect of data is incoherent.
Abstract
Multimodal AI systems are evaluated by downstream task accuracy, but high accuracy does not mean the underlying data is coherent. A model can score well on Visual Question Answering (VQA) while its inputs contradict each other. We introduce the Multimodal Coherence Score (MCS), a metric that evaluates fusion quality independent of any downstream model. MCS decomposes coherence into four dimensions, identity, spatial, semantic, and decision, with weights learned via Nelder-Mead optimization. We evaluate on 1,000 Visual Genome images using DETR, CLIP, and ViLT, and validate on 150 COCO images with no retraining. Across three fusion architectures, MCS discriminates quality with higher sensitivity than task accuracy alone (Spearman rho = 0.093 vs. 0.071). Perturbation experiments confirm each dimension responds independently to its failure mode with zero cross-talk. MCS is lightweight,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
