MotionStream: Real-Time Video Generation with Interactive Motion Controls
Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, Xun Huang

TL;DR
MotionStream introduces a real-time, interactive video generation system that produces high-quality, infinite-length videos with sub-second latency by distilling a non-causal model into a causal, streaming-capable model.
Contribution
The paper presents a novel method for real-time, infinite-length video generation with motion control, using a distillation approach and innovative attention mechanisms to ensure efficiency and quality.
Findings
Achieves up to 29 FPS streaming generation on a single GPU.
Enables infinite-length video generation with constant computational cost.
Outperforms previous methods in motion fidelity and video quality.
Abstract
Current motion-conditioned video generation methods suffer from prohibitive latency (minutes per video) and non-causal processing that prevents real-time interaction. We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU. Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly. As such, we distill this bidirectional teacher into a causal student through Self Forcing with Distribution Matching Distillation, enabling real-time streaming inference. Several key challenges arise when generating videos of long, potentially infinite time-horizons -- (1) bridging the domain gap from training on finite length and extrapolating to infinite horizons, (2) sustaining high quality by…
Peer Reviews
Decision·ICLR 2026 Oral
1. Complete "Teacher-Student" Framework: The entire pipeline, from designing an efficient motion-controlled teacher model to distilling a causal student model, is rigorously and systematically designed. 2. Lightweight Track Encoder: The use of sinusoidal positional encoding with a learnable track head, compared to the VAE-based RGB encoding method, achieves a 40× speedup in encoding while maintaining high track adherence accuracy, which is crucial for real-time systems. 3. Attention Sinking with
1. It is worth noting that utilizing distillation to enhance model inference efficiency is an established technique in model optimization, including the video generation domain. Several contemporaneous works, such as Hunyuan-Gamecraft and notably Matrix-Game 2.0, have adopted a highly similar paradigm—combining a distillation framework with an autoregressive student model and Self-Forcing training. While the integration presented here is non-trivial, the core methodological concept bears a stron
1. Addresses a timely and important problem: bringing real-time interaction to diffusion-based video models. 2. The proposed attention sink and KV cache mechanisms are intuitively sound and practically effective. 3. Extensive experiments and ablations support the framework's stability and efficiency. 4. Overall writing and presentation are polished and easy to follow.
1. The evaluation datasets appear limited to short human-action or camera-move clips; it's unclear how the model performs on scenes with large physical transformations (e.g., object deformation, rapid perspective change). 2. Limited theoretical analysis; most findings are empirical.
1. The manuscript’s motivation is clear and compelling, and the proposed solution appears technically sound and promising. 2. The authors provide thorough comparative evaluations and ablation studies that isolate the contributions of each component. 3. The presented demos are persuasive, showcasing the method’s practical impact.
1. The manuscript now reads like a composition of prior work—self-forcing, attention sinks, and motion conditioned video generation—assembled to achieve real-time, controllable video generation. However, specific contributions are not clearly presented. 2. For motion conditioning, the method employs channel-wise concatenated embeddings and fine-tunes pretrained image-to-video models. While computationally efficient, this design may degrade the pretrained model’s general generative capabilities.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Human Pose and Action Recognition
