RVTBench: A Benchmark for Visual Reasoning Tasks
Yiqing Shen, Chenjia Li, Chenxiao Fan, Mathias Unberath

TL;DR
RVTBench is a comprehensive benchmark for visual reasoning tasks that uses digital twin representations to better evaluate complex multi-step reasoning in videos, supporting various output formats and difficulty levels.
Contribution
The paper introduces RVTBench, a novel automated benchmark construction pipeline utilizing digital twins, and proposes RVTagent, a versatile agent framework for zero-shot visual reasoning.
Findings
RVTBench contains 3,896 queries across multiple reasoning types.
The benchmark covers semantic, spatial, and temporal reasoning categories.
RVTTagent achieves zero-shot generalization across different RVT tasks.
Abstract
Visual reasoning, the capability to interpret visual input in response to implicit text query through multi-step reasoning, remains a challenge for deep learning models due to the lack of relevant benchmarks. Previous work in visual reasoning has primarily focused on reasoning segmentation, where models aim to segment objects based on implicit text queries. This paper introduces reasoning visual tasks (RVTs), a unified formulation that extends beyond traditional video reasoning segmentation to a diverse family of visual language reasoning problems, which can therefore accommodate multiple output formats including bounding boxes, natural language descriptions, and question-answer pairs. Correspondingly, we identify the limitations in current benchmark construction methods that rely solely on large language models (LLMs), which inadequately capture complex spatial-temporal relationships…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Graph Neural Networks · Topic Modeling
