HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Shenzhi Wang; Shixuan Liu; Jing Zhou; Chang Gao; Xiong-Hui Chen; Binghai Wang; An Yang; Shiji Song; Bowen Yu; Gao Huang; and Junyang Lin

arXiv:2603.17024·cs.CV·March 20, 2026

HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Shenzhi Wang, Shixuan Liu, Jing Zhou, Chang Gao, Xiong-Hui Chen, Binghai Wang, An Yang, Shiji Song, Bowen Yu, Gao Huang, and Junyang Lin

PDF

Open Access

TL;DR

HopChain is a scalable framework that synthesizes multi-hop vision-language reasoning data to enhance the reasoning capabilities of models across diverse benchmarks, especially in complex, multi-step tasks.

Contribution

We introduce HopChain, a novel method for generating multi-hop reasoning data that significantly improves vision-language models' performance on various benchmarks.

Findings

01

Improves 20 out of 24 benchmarks across multiple domains.

02

Enhances long-chain reasoning, exceeding 50 points in ultra-long-CoT tasks.

03

Replacing multi-hop data with simpler variants reduces performance.

Abstract

Vision-language models (VLMs) show strong multimodal capabilities but still struggle with fine-grained vision-language reasoning. We find that long chain-of-thought (CoT) reasoning exposes diverse failure modes, including perception, reasoning, knowledge, and hallucination errors, which can compound across intermediate steps. However, most existing vision-language data used for reinforcement learning with verifiable rewards (RLVR) does not involve complex reasoning chains that rely on visual evidence throughout, leaving these weaknesses largely unexposed. We therefore propose HopChain, a scalable framework for synthesizing multi-hop vision-language reasoning data for RLVR training of VLMs. Each synthesized multi-hop query forms a logically dependent chain of instance-grounded hops, where earlier hops establish the instances, sets, or conditions needed for later hops, while the final…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques