SINGER: Vivid Audio-driven Singing Video Generation with Multi-scale Spectral Diffusion Model
Yan Li, Ziya Zhou, Zhiqiang Wang, Wei Xue, Wenhan Luo, Yike Guo

TL;DR
SINGER is a novel diffusion-based model that generates vivid singing videos by learning spectral patterns and human behaviors from a new in-the-wild singing dataset, outperforming existing methods.
Contribution
The paper introduces a multi-scale spectral and spectral-filtering modules into a diffusion model for singing video generation and provides a new high-quality singing dataset.
Findings
SINGER outperforms state-of-the-art methods in objective metrics.
SINGER produces more vivid and realistic singing videos.
The spectral modules effectively capture singing-specific audio patterns.
Abstract
Recent advancements in generative models have significantly enhanced talking face video generation, yet singing video generation remains underexplored. The differences between human talking and singing limit the performance of existing talking face video generation models when applied to singing. The fundamental differences between talking and singing-specifically in audio characteristics and behavioral expressions-limit the effectiveness of existing models. We observe that the differences between singing and talking audios manifest in terms of frequency and amplitude. To address this, we have designed a multi-scale spectral module to help the model learn singing patterns in the spectral domain. Additionally, we develop a spectral-filtering module that aids the model in learning the human behaviors associated with singing audio. These two modules are integrated into the diffusion model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsDiffusion
