Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model
Tianle Li, Jihai Zhang, Yongming Rao, Yu Cheng

TL;DR
This paper investigates the compositional reasoning abilities of vision-language models (VLMs), revealing current training strategies' limitations and proposing methods to enhance their cross-modal generalization and compositional reasoning skills.
Contribution
The study systematically evaluates VLMs' compositional reasoning, identifying key factors like visual grounding and captioning that improve their ability to generalize across tasks and modalities.
Findings
RL-trained VLMs outperform supervised fine-tuning models in compositional generalization
Current VLMs struggle with cross-modal and cross-task compositional generalization
Explicit visual description and grounding strategies improve VLMs' reasoning capabilities
Abstract
While large language models (LLMs) demonstrate strong reasoning capabilities utilizing reinforcement learning (RL) with verifiable reward, whether large vision-language models (VLMs) can directly inherit such capabilities through similar post-training strategies remains underexplored. In this work, we conduct a systematic compositional probing study to evaluate whether current VLMs trained with RL or other post-training strategies can compose capabilities across modalities or tasks under out-of-distribution conditions. We design a suite of diagnostic tasks that train models on unimodal tasks or isolated reasoning skills, and evaluate them on multimodal, compositional variants requiring skill integration. Through comparisons between supervised fine-tuning (SFT) and RL-trained models, we identify three key findings: (1) RL-trained models consistently outperform SFT on compositional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRobotics and Automated Systems
MethodsShrink and Fine-Tune
