SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Zhenwei Tang, Difan Jiao, Blair Yang, Ashton Anderson

TL;DR
SEAM is a benchmark designed to evaluate vision-language models' reasoning consistency across modalities using semantically equivalent inputs, revealing modality imbalances and perception failures.
Contribution
Introduces SEAM, a novel benchmark with standardized, semantically equivalent inputs across modalities for rigorous assessment of VLM reasoning capabilities.
Findings
Vision models often perform better on language than vision tasks.
Cross-modal agreement between vision and language is relatively low.
Perception failures are mainly due to tokenization and hallucinations.
Abstract
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
