TL;DR
RealCam introduces an autoregressive framework for real-time, camera-controlled video synthesis from monocular footage, overcoming latency and scalability issues of previous methods.
Contribution
It presents a novel in-context learning paradigm, distillation approach, and loop consistency augmentation to enable interactive, real-time video generation.
Findings
Achieves state-of-the-art visual fidelity and temporal consistency.
Enables truly interactive camera control with faster inference.
Breaks the rigid prefix bottleneck of prior methods.
Abstract
Camera-controlled video-to-video (V2V) generation enables dynamic viewpoint synthesis from monocular footage, holding immense potential for interactive filmmaking and live broadcasting. However, existing implicit synthesis methods fundamentally rely on non-causal, full-sequence processing and rigid prefix-style temporal concatenation. This architectural paradigm mandates bidirectional attention, resulting in prohibitive computational latency, quadratic complexity scaling, and inherent incompatibility with real-time streaming or variable-length inputs. To overcome these limitations, we introduce \texttt{RealCam}, a novel autoregressive framework for interactive, real-time camera-controlled V2V generation. We first design a high-fidelity teacher model grounded in a \textbf{Cross-frame In-context Learning} paradigm. By interleaving source and target frames into synchronized contextual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
