Apollo: Unified Multi-Task Audio-Video Joint Generation
Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng, Feng Deng

TL;DR
Apollo introduces a unified multi-task framework for high-quality, temporally aligned audio-video generation, leveraging novel architecture, training strategies, and a large-scale dataset to outperform prior methods and enhance generalization.
Contribution
The paper presents a novel unified model architecture, a progressive training regime, and a large-scale dataset for improved audio-video joint generation, addressing key challenges in synchronization and generalization.
Findings
Achieves tight audio-visual alignment and scalability.
Outperforms prior methods significantly across tasks.
Generalizes robustly to out-of-distribution data.
Abstract
Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Apollo and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies
