CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

Junyoung Sung; Seungwoo Lyu; Minjun Kim; Sumin An; Arsha Nagrani; Paul Hongsuck Seo

arXiv:2604.01634·cs.LG·April 3, 2026

CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning

Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo

PDF

TL;DR

CRIT introduces a graph-based automatic pipeline to generate complex cross-modal reasoning tasks, addressing limitations in existing benchmarks and improving model performance on multi-hop reasoning across diverse modalities.

Contribution

A novel dataset and benchmark, CRIT, with an automatic graph-based pipeline for creating challenging cross-modal reasoning tasks across multiple domains.

Findings

01

Models trained on CRIT outperform on multi-hop reasoning benchmarks.

02

State-of-the-art models struggle with complex cross-modal reasoning tasks.

03

CRIT improves model performance on SPIQA and other multimodal benchmarks.

Abstract

Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.