IsoBench: Benchmarking Multimodal Foundation Models on Isomorphic Representations
Deqing Fu, Ruohao Guo, Ghazal Khalighinejad, Ollie Liu, Bhuwan, Dhingra, Dani Yogatama, Robin Jia, Willie Neiswanger

TL;DR
IsoBench is a benchmark dataset that evaluates foundation models across different input modalities using isomorphic representations, revealing modality preferences and proposing techniques to improve multimodal performance.
Contribution
The paper introduces IsoBench, a novel benchmark with isomorphic input representations, and proposes prompting methods to enhance multimodal model capabilities.
Findings
Models prefer textual over visual representations.
Claude-3 Opus performs 28.7 points worse with images.
Proposed techniques improve performance by leveraging multiple representations.
Abstract
Current foundation models exhibit impressive capabilities when prompted either with text only or with both image and text inputs. But do their capabilities change depending on the input modality? In this work, we propose , a benchmark dataset containing problems from four major areas: math, science, algorithms, and games. Each example is presented with multiple of inputs, such as visual, textual, and mathematical presentations. IsoBench provides fine-grained feedback to diagnose performance gaps caused by the form of the representation. Across various foundation models, we observe that on the same problem, models have a consistent preference towards textual representations. Most prominently, when evaluated on all IsoBench problems, Claude-3 Opus performs 28.7 points worse when provided with images instead of text; similarly, GPT-4…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing
