Transformation Driven Visual Reasoning
Xin Hong, Yanyan Lan, Liang Pang, Jiafeng Guo, Xueqi Cheng

TL;DR
This paper introduces a transformation driven visual reasoning paradigm and dataset, emphasizing the importance of inferring dynamic changes between states, which challenges current static reasoning models.
Contribution
It proposes a new reasoning task and dataset focusing on transformations between states, extending beyond static concept understanding in visual reasoning.
Findings
Models excel on single-step transformations but struggle with multi-step and view-invariant transformations.
The dataset TRANCE enables evaluation of dynamic reasoning capabilities.
Current models are far from human-level performance on complex transformation tasks.
Abstract
This paper defines a new visual reasoning paradigm by introducing an important factor, i.e.~transformation. The motivation comes from the fact that most existing visual reasoning tasks, such as CLEVR in VQA, are solely defined to test how well the machine understands the concepts and relations within static settings, like one image. We argue that this kind of \textbf{state driven visual reasoning} approach has limitations in reflecting whether the machine has the ability to infer the dynamics between different states, which has been shown as important as state-level reasoning for human cognition in Piaget's theory. To tackle this problem, we propose a novel \textbf{transformation driven visual reasoning} task. Given both the initial and final states, the target is to infer the corresponding single-step or multi-step transformation, represented as a triplet (object, attribute, value) or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
