Grid: Omni Visual Generation
Cong Wan, Xiangyang Luo, Hao Luo, Zijian Cai, Yiren Song, Yunlong Zhao, Yifan Bai, Fan Wang, Yuhang He, Yihong Gong

TL;DR
GRID introduces a novel approach that reformulates temporal visual sequences as grid layouts, enabling efficient, versatile, and high-quality generation across images, videos, and 3D editing with significantly reduced computational costs.
Contribution
The paper presents GRID, a method that leverages existing image models for temporal sequence generation by reformulating sequences as grid layouts, achieving faster inference and broad applicability.
Findings
Achieves up to 67 times faster inference speeds.
Uses less than 0.1% of the computational resources of specialized models.
Performs well across tasks from Text-to-Video to 3D editing.
Abstract
Visual generation has witnessed remarkable progress in single-image tasks, yet extending these capabilities to temporal sequences remains challenging. Current approaches either build specialized video models from scratch with enormous computational costs or add separate motion modules to image generators, both requiring learning temporal dynamics anew. We observe that modern image generation models possess underutilized potential in handling structured layouts with implicit temporal understanding. Building on this insight, we introduce GRID, which reformulates temporal sequences as grid layouts, enabling holistic processing of visual sequences while leveraging existing model capabilities. Through a parallel flow-matching training strategy with coarse-to-fine scheduling, our approach achieves up to 67 faster inference speeds while using <1/1000 of the computational resources compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Advanced Vision and Imaging · Robotics and Sensor-Based Localization
