Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

Simon Park; Abhishek Panigrahi; Yun Cheng; Dingli Yu; Anirudh Goyal; Sanjeev Arora

arXiv:2501.02669·cs.CV·June 3, 2025

Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?

Simon Park, Abhishek Panigrahi, Yun Cheng, Dingli Yu, Anirudh Goyal, Sanjeev Arora

PDF

Open Access 1 Repo 8 Models 3 Datasets 1 Video

TL;DR

This paper introduces a synthetic framework to evaluate and improve multi-step visual reasoning in vision-language models, focusing on mitigating modality imbalance through training strategies like explicit image-to-text conversion and chain-of-thought reasoning.

Contribution

It proposes a systematic approach to assess and enhance simple-to-hard generalization in visual reasoning tasks, emphasizing training strategies that transfer reasoning from text to images.

Findings

01

Explicit image-to-text conversion improves reasoning transfer.

02

Chain-of-thought enhances generalization performance.

03

Gradient alignment measures can identify effective training strategies.

Abstract

Vision Language Models (VLMs) are impressive at visual question answering and image captioning. But they underperform on multi-step visual reasoning -- even compared to LLMs on the same tasks presented in text form -- giving rise to perceptions of modality imbalance or brittleness. Towards a systematic study of such issues, we introduce a synthetic framework for assessing the ability of VLMs to perform algorithmic visual reasoning, comprising three tasks: Table Readout, Grid Navigation, and Visual Analogy. Each has two levels of difficulty, SIMPLE and HARD, and even the SIMPLE versions are difficult for frontier VLMs. We propose strategies for training on the SIMPLE version of tasks that improve performance on the corresponding HARD task, i.e., simple-to-hard (S2H) generalization. This controlled setup, where each task also has an equivalent text-only version, allows a quantification of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

princeton-pli/vlm_s2h
pytorchOfficial

Models

Datasets

Videos

Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?· slideslive

Taxonomy

TopicsElevator Systems and Control