CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Chengzhuo Tong; Mingkun Chang; Shenglong Zhang; Yuran Wang; Cheng Liang; Zhizheng Zhao; Ruichuan An; Bohan Zeng; Yang Shi; Yifan Dai; Ziming Zhao; Guanbin Li; Pengfei Wan; Yuanxing Zhang; Wentao Zhang

arXiv:2601.10061·cs.CV·January 16, 2026

CoF-T2I: Video Models as Pure Visual Reasoners for Text-to-Image Generation

Chengzhuo Tong, Mingkun Chang, Shenglong Zhang, Yuran Wang, Cheng Liang, Zhizheng Zhao, Ruichuan An, Bohan Zeng, Yang Shi, Yifan Dai, Ziming Zhao, Guanbin Li, Pengfei Wan, Yuanxing Zhang, Wentao Zhang

PDF

Open Access

TL;DR

This paper introduces CoF-T2I, a novel model that leverages video reasoning techniques for improved text-to-image generation by using explicit intermediate reasoning steps and a new dataset of visual trajectories.

Contribution

The paper presents CoF-T2I, integrating Chain-of-Frame reasoning into T2I generation with progressive refinement and a new dataset, enhancing interpretability and quality.

Findings

01

Outperforms base video models in T2I tasks

02

Achieves 0.86 on GenEval benchmark

03

Reaches 7.468 on Imagine-Bench benchmark

Abstract

Recent video generation models have revealed the emergence of Chain-of-Frame (CoF) reasoning, enabling frame-by-frame visual inference. With this capability, video models have been successfully applied to various visual tasks (e.g., maze solving, visual puzzles). However, their potential to enhance text-to-image (T2I) generation remains largely unexplored due to the absence of a clearly defined visual reasoning starting point and interpretable intermediate states in the T2I generation process. To bridge this gap, we propose CoF-T2I, a model that integrates CoF reasoning into T2I generation via progressive visual refinement, where intermediate frames act as explicit reasoning steps and the final frame is taken as output. To establish such an explicit generation process, we curate CoF-Evol-Instruct, a dataset of CoF trajectories that model the generation process from semantics to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Artificial Intelligence in Games