VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Li-Heng Chen; Ke Cheng; Yahui Liu; Lei Shi; Shi-Sheng Huang; Hongbo Fu

arXiv:2603.28353·cs.CV·March 31, 2026

VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning

Li-Heng Chen, Ke Cheng, Yahui Liu, Lei Shi, Shi-Sheng Huang, Hongbo Fu

PDF

TL;DR

VistaGEN is a novel driving video generation method that offers fine-grained object control and maintains spatiotemporal consistency in long videos through multiview visual-language reasoning and a closed-loop refinement process.

Contribution

It introduces a multiview visual-language reasoning framework and a closed-loop generation-evaluation-regeneration mechanism for high-quality, controllable, and consistent long driving videos.

Findings

01

Achieves diverse, fine-grained control over driving videos.

02

Maintains superior spatiotemporal consistency in long videos.

03

Effectively handles long-tail objects in generated videos.

Abstract

Driving video generation has achieved much progress in controllability, video resolution, and length, but fails to support fine-grained object-level controllability for diverse driving videos, while preserving the spatiotemporal consistency, especially in long video generation. In this paper, we present a new driving video generation technique, called VistaGEN, which enables fine-grained control of specific entities, including 3D objects, images, and text descriptions, while maintaining spatiotemporal consistency in long video sequences. Our key innovation is the incorporation of multiview visual-language reasoning into the long driving video generation. To this end, we inject visual-language features into a multiview video generator to enable fine-grained controllability. More importantly, we propose a multiview vision-language evaluator (MV-VLM) to intelligently and automatically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.