Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?
Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, Sanjeev Arora

TL;DR
This paper introduces a synthetic framework to evaluate and improve multi-step visual reasoning in vision-language models, focusing on mitigating modality imbalance through training strategies like explicit image-to-text conversion and chain-of-thought reasoning.
Contribution
It proposes a systematic approach to assess and enhance simple-to-hard generalization in visual reasoning tasks, emphasizing training strategies that transfer reasoning from text to images.
Findings
Explicit image-to-text conversion improves reasoning transfer.
Chain-of-thought enhances generalization performance.
Gradient alignment measures can identify effective training strategies.
Abstract
Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning -- even compared to LLMs on the same tasks presented in text form -- giving rise to perceptions of modality imbalance or brittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗PrincetonPLI/Eagle-X2-Llama3-8B-VisualAnalogy-MixPlus-120kmodel
- 🤗PrincetonPLI/Eagle-X2-Llama3-8B-VisualAnalogy-AlignMixPlus-120kmodel· 2 dl2 dl
- 🤗PrincetonPLI/Eagle-X2-Llama3-8B-TableReadout-MixPlus-240kmodel· 2 dl2 dl
- 🤗PrincetonPLI/Eagle-X2-Llama3-8B-GridNavigation-MixPlus-120kmodel· 3 dl3 dl
- 🤗PrincetonPLI/Eagle-X2-Llama3-8B-ConsecutiveTableReadout-Mix-160kmodel
- 🤗PrincetonPLI/Eagle-X2-Llama3-8B-TableReadout-AlignMixPlus-240kmodel· 2 dl2 dl
- 🤗PrincetonPLI/Eagle-X2-Llama3-8B-GridNavigation-AlignMixPlus-120kmodel· 1 dl1 dl
- 🤗PrincetonPLI/Eagle-X2-Llama3-8Bmodel· 3 dl3 dl
Videos
Taxonomy
TopicsElevator Systems and Control
