Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D

Artemis Panagopoulou; Le Xue; Honglu Zhou; silvio savarese; Ran Xu; Caiming Xiong; Chris Callison-Burch; Mark Yatskar; Juan Carlos Niebles

arXiv:2506.01275·cs.AI·September 17, 2025

Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D

Artemis Panagopoulou, Le Xue, Honglu Zhou, silvio savarese, Ran Xu, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles

PDF

Open Access

TL;DR

This paper introduces Contra4, a dataset designed to evaluate the ability of multimodal models to perform contrastive reasoning across image, audio, video, and 3D modalities, revealing current models' limitations.

Contribution

The paper presents Contra4, a large-scale dataset for contrastive cross-modal reasoning, and provides an analysis of current models' performance and challenges in this task.

Findings

01

State-of-the-art models achieve only 56% accuracy overall.

02

Fine-tuning improves performance by 56% relative.

03

Models struggle significantly in four-modality reasoning scenarios.

Abstract

Real-world decision-making often begins with identifying which modality contains the most relevant information for a given query. While recent multimodal models have made impressive progress in processing diverse inputs, it remains unclear whether they can reason contrastively across multiple modalities to select the one that best satisfies a natural language prompt. We argue this capability is foundational, especially in retrieval-augmented and decision-time contexts, where systems must evaluate multiple signals and identify which one conveys the relevant information. To evaluate this skill, we introduce Contra4, a dataset for contrastive cross-modal reasoning across four modalities: image, audio, video, and 3D. Each example presents a natural language question alongside multiple candidate modality instances, and the model must select the one that semantically aligns with the prompt.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies