FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment   Generation

Jianyi Chen; Wei Xue; Xu Tan; Zhen Ye; Qifeng Liu; Yike Guo

arXiv:2405.07682·cs.SD·May 14, 2024

FastSAG: Towards Fast Non-Autoregressive Singing Accompaniment Generation

Jianyi Chen, Wei Xue, Xu Tan, Zhen Ye, Qifeng Liu, Yike Guo

PDF

Open Access

TL;DR

FastSAG introduces a non-autoregressive diffusion-based framework for singing accompaniment generation, achieving high-quality results with at least 30 times faster speed than previous autoregressive models, enabling real-time applications.

Contribution

This paper presents a novel non-autoregressive diffusion model for SAG that simplifies existing frameworks and significantly accelerates generation while maintaining high quality.

Findings

01

Generates better samples than SingSong.

02

Achieves at least 30 times faster generation speed.

03

Ensures semantic and rhythm coherence with vocal signals.

Abstract

Singing Accompaniment Generation (SAG), which generates instrumental music to accompany input vocals, is crucial to developing human-AI symbiotic art creation systems. The state-of-the-art method, SingSong, utilizes a multi-stage autoregressive (AR) model for SAG, however, this method is extremely slow as it generates semantic and acoustic tokens recursively, and this makes it impossible for real-time applications. In this paper, we aim to develop a Fast SAG method that can create high-quality and coherent accompaniments. A non-AR diffusion-based framework is developed, which by carefully designing the conditions inferred from the vocal signals, generates the Mel spectrogram of the target accompaniment directly. With diffusion and Mel spectrogram modeling, the proposed method significantly simplifies the AR token-based SingSong framework, and largely accelerates the generation. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training · Self-Attention Guidance · Diffusion