VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Yuheng Ji; Yipu Wang; Yuyang Liu; Xiaoshuai Hao; Yue Liu; Yuting Zhao; Huaihai Lyu; Xiaolong Zheng

arXiv:2508.04043·cs.CV·August 7, 2025

VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning

Yuheng Ji, Yipu Wang, Yuyang Liu, Xiaoshuai Hao, Yue Liu, Yuting Zhao, Huaihai Lyu, Xiaolong Zheng

PDF

1 Datasets

TL;DR

VisualTrans is a comprehensive benchmark designed to evaluate real-world visual transformation reasoning in human-object interactions, highlighting current models' strengths and weaknesses in dynamic, multi-step reasoning tasks.

Contribution

The paper introduces VisualTrans, the first benchmark for real-world VTR with diverse tasks, high-quality data, and a scalable data construction pipeline, addressing limitations of previous benchmarks.

Findings

01

State-of-the-art models perform well on static spatial tasks.

02

Models struggle with dynamic, multi-step reasoning scenarios.

03

Temporal modeling and causal reasoning are key challenges.

Abstract

Visual transformation reasoning (VTR) is a vital cognitive capability that empowers intelligent agents to understand dynamic scenes, model causal relationships, and predict future states, and thereby guiding actions and laying the foundation for advanced intelligent systems. However, existing benchmarks suffer from a sim-to-real gap, limited task complexity, and incomplete reasoning coverage, limiting their practical use in real-world scenarios. To address these limitations, we introduce VisualTrans, the first comprehensive benchmark specifically designed for VTR in real-world human-object interaction scenarios. VisualTrans encompasses 12 semantically diverse manipulation tasks and systematically evaluates three essential reasoning dimensions - spatial, procedural, and quantitative - through 6 well-defined subtask types. The benchmark features 472 high-quality question-answer pairs in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

WangYipu2002/VisualTrans
dataset· 13 dl
13 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.