Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model

Tianle Li; Jihai Zhang; Yongming Rao; Yu Cheng

arXiv:2505.19406·cs.AI·May 27, 2025

Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model

Tianle Li, Jihai Zhang, Yongming Rao, Yu Cheng

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the compositional reasoning abilities of vision-language models (VLMs), revealing current training strategies' limitations and proposing methods to enhance their cross-modal generalization and compositional reasoning skills.

Contribution

The study systematically evaluates VLMs' compositional reasoning, identifying key factors like visual grounding and captioning that improve their ability to generalize across tasks and modalities.

Findings

01

RL-trained VLMs outperform supervised fine-tuning models in compositional generalization

02

Current VLMs struggle with cross-modal and cross-task compositional generalization

03

Explicit visual description and grounding strategies improve VLMs' reasoning capabilities

Abstract

While large language models (LLMs) demonstrate strong reasoning capabilities utilizing reinforcement learning (RL) with verifiable reward, whether large vision-language models (VLMs) can directly inherit such capabilities through similar post-training strategies remains underexplored. In this work, we conduct a systematic compositional probing study to evaluate whether current VLMs trained with RL or other post-training strategies can compose capabilities across modalities or tasks under out-of-distribution conditions. We design a suite of diagnostic tasks that train models on unimodal tasks or isolated reasoning skills, and evaluate them on multimodal, compositional variants requiring skill integration. Through comparisons between supervised fine-tuning (SFT) and RL-trained models, we identify three key findings: (1) RL-trained models consistently outperform SFT on compositional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ltl3a87/compa
pytorchOfficial

Videos

Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model· slideslive

Taxonomy

TopicsRobotics and Automated Systems

MethodsShrink and Fine-Tune