EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

Rang Meng; Weipeng Wu; Yuming Li; Chenguang Ma

arXiv:2602.13669·cs.CV·April 23, 2026

EchoTorrent: Towards Swift, Sustained, and Streaming Multi-Modal Video Generation

Rang Meng, Weipeng Wu, Yuming Li, Chenguang Ma

PDF

TL;DR

EchoTorrent introduces a multi-faceted approach to improve real-time, streaming multi-modal video generation by enhancing temporal stability, fidelity, and synchronization through innovative training and inference techniques.

Contribution

It presents a novel schema combining multi-teacher training, adaptive CFG calibration, hybrid long tail forcing, and VAE decoder refinement for superior streaming video generation.

Findings

01

Achieves few-pass autoregressive generation with extended temporal consistency.

02

Improves identity preservation and audio-lip synchronization.

03

Reduces latency and mitigates multimodal degradation in streaming mode.

Abstract

Recent multi-modal video generation models have achieved high visual quality, but their prohibitive latency and limited temporal stability hinder real-time deployment. Streaming inference exacerbates these issues, leading to pronounced multimodal degradation, such as spatial blurring, temporal drift, and lip desynchronization, which creates an unresolved efficiency-performance trade-off. To this end, we propose EchoTorrent, a novel schema with a fourfold design: (1) Multi-Teacher Training fine-tunes a pre-trained model on distinct preference domains to obtain specialized domain experts, which sequentially transfer domain-specific knowledge to a student model; (2) Adaptive CFG Calibration (ACC-DMD), which calibrates the audio CFG augmentation errors in DMD via a phased spatiotemporal schedule, eliminating redundant CFG computations and enabling single-pass inference per step; (3) Hybrid…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.