Apollo: Unified Multi-Task Audio-Video Joint Generation

Jun Wang; Chunyu Qiang; Yuxin Guo; Yiran Wang; Xijuan Zeng; Feng Deng

arXiv:2601.04151·cs.CV·January 14, 2026

Apollo: Unified Multi-Task Audio-Video Joint Generation

Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng, Feng Deng

PDF

Open Access

TL;DR

Apollo introduces a unified multi-task framework for high-quality, temporally aligned audio-video generation, leveraging novel architecture, training strategies, and a large-scale dataset to outperform prior methods and enhance generalization.

Contribution

The paper presents a novel unified model architecture, a progressive training regime, and a large-scale dataset for improved audio-video joint generation, addressing key challenges in synchronization and generalization.

Findings

01

Achieves tight audio-visual alignment and scalability.

02

Outperforms prior methods significantly across tasks.

03

Generalizes robustly to out-of-distribution data.

Abstract

Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Apollo and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Generative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies