SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Zhenwei Tang; Difan Jiao; Blair Yang; Ashton Anderson

arXiv:2508.18179·cs.AI·August 26, 2025

SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models

Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson

PDF

1 Datasets

TL;DR

SEAM is a benchmark designed to evaluate vision-language models' reasoning consistency across modalities using semantically equivalent inputs, revealing modality imbalances and perception failures.

Contribution

Introduces SEAM, a novel benchmark with standardized, semantically equivalent inputs across modalities for rigorous assessment of VLM reasoning capabilities.

Findings

01

Vision models often perform better on language than vision tasks.

02

Cross-modal agreement between vision and language is relatively low.

03

Perception failures are mainly due to tokenization and hallucinations.

Abstract

Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lilvjosephtang/SEAM-Benchmark
dataset· 306 dl
306 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.