EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation
Rang Meng, Weipeng Wu, Yuming Li, Chenguang Ma

TL;DR
EchoTorrent introduces a multi-faceted approach to improve real-time, streaming multi-modal video generation by enhancing temporal stability, fidelity, and synchronization through innovative training and inference techniques.
Contribution
It presents a novel schema combining multi-teacher training, adaptive CFG calibration, hybrid long tail forcing, and VAE decoder refinement for superior streaming video generation.
Findings
Achieves few-pass autoregressive generation with extended temporal consistency.
Improves identity preservation and audio-lip synchronization.
Reduces latency and mitigates multimodal degradation in streaming mode.
Abstract
Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
