ARCON: Advancing Auto-Regressive Continuation for Driving Videos
Ruibo Ming, Jingwei Wu, Zhewei Huang, Zhuoxuan Ju, Jianming HU, Lihui, Peng, Shuchang Zhou

TL;DR
ARCON introduces a novel approach for driving video continuation by alternating semantic and RGB token generation in large vision models, resulting in high consistency and long video generation in autonomous driving scenarios.
Contribution
The paper presents ARCON, a new scheme that improves video continuation by explicitly learning high-level structure through token alternation and optical flow-based enhancement.
Findings
High consistency in generated RGB images and semantic maps.
Effective long video generation in autonomous driving scenarios.
Enhanced visual quality through optical flow-based stitching.
Abstract
Recent advancements in auto-regressive large language models (LLMs) have led to their application in video generation. This paper explores the use of Large Vision Models (LVMs) for video continuation, a task essential for building world models and predicting future frames. We introduce ARCON, a scheme that alternates between generating semantic and RGB tokens, allowing the LVM to explicitly learn high-level structural video information. We find high consistency in the RGB images and semantic maps generated without special design. Moreover, we employ an optical flow-based texture stitching method to enhance visual quality. Experiments in autonomous driving scenarios show that our model can consistently generate long videos.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Enhancement Techniques · Advanced Vision and Imaging · Advanced Image Processing Techniques
