SRC-gAudio: Sampling-Rate-Controlled Audio Generation
Chenxing Li, Manjie Xu, Dong Yu

TL;DR
SRC-gAudio is a unified diffusion-based model that enables text-to-audio generation across multiple sampling rates, leveraging low-sampling-rate data to improve high-quality audio synthesis.
Contribution
The paper introduces a novel sampling-rate-controlled audio generation model that supports multiple sampling rates within a single architecture and explores benefits of low-sampling-rate pre-training.
Findings
Effective multi-rate audio generation demonstrated
Pre-training on low-sampling-rate data improves quality
Model outperforms existing methods in controlled sampling rate generation
Abstract
We introduce SRC-gAudio, a novel audio generation model designed to facilitate text-to-audio generation across a wide range of sampling rates within a single model architecture. SRC-gAudio incorporates the sampling rate as part of the generation condition to guide the diffusion-based audio generation process. Our model enables the generation of audio at multiple sampling rates with a single unified model. Furthermore, we explore the potential benefits of large-scale, low-sampling-rate data in enhancing the generation quality of high-sampling-rate audio. Through extensive experiments, we demonstrate that SRC-gAudio effectively generates audio under controlled sampling rates. Additionally, our results indicate that pre-training on low-sampling-rate data can lead to significant improvements in audio quality across various metrics.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
