TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

Junhua Liu; Zhangcheng Wang; Zhike Han; Ningli Wang; Guotao Liang; Kun Kuang

arXiv:2602.10675·cs.CV·February 12, 2026

TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning

Junhua Liu, Zhangcheng Wang, Zhike Han, Ningli Wang, Guotao Liang, Kun Kuang

PDF

Open Access 1 Models 2 Datasets

TL;DR

TwiFF introduces a large-scale, temporally grounded video reasoning dataset and a novel model that enhances dynamic visual reasoning by integrating future frame prediction with question answering, significantly outperforming existing methods.

Contribution

The paper presents TwiFF-2.7M, a large-scale dataset for dynamic visual reasoning, and TwiFF, a model that combines video generation and comprehension for improved temporal reasoning.

Findings

01

TwiFF outperforms existing VCoT methods on dynamic reasoning tasks.

02

The dataset enables better training and evaluation of temporal reasoning models.

03

TwiFF model effectively generates temporally coherent future frames for reasoning.

Abstract

Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1, 078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
Liu-Junhua/TwiFF-7B
model· 4 dl
4 dl

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection