TwiFF (Think With Future Frames): A Large-Scale Dataset for Dynamic Visual Reasoning
Junhua Liu, Zhangcheng Wang, Zhike Han, Ningli Wang, Guotao Liang, Kun Kuang

TL;DR
TwiFF introduces a large-scale, temporally grounded video reasoning dataset and a novel model that enhances dynamic visual reasoning by integrating future frame prediction with question answering, significantly outperforming existing methods.
Contribution
The paper presents TwiFF-2.7M, a large-scale dataset for dynamic visual reasoning, and TwiFF, a model that combines video generation and comprehension for improved temporal reasoning.
Findings
TwiFF outperforms existing VCoT methods on dynamic reasoning tasks.
The dataset enables better training and evaluation of temporal reasoning models.
TwiFF model effectively generates temporally coherent future frames for reasoning.
Abstract
Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Visual Attention and Saliency Detection
