Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio   Generation

Hongming Guo; Ruibo Fu; Yizhong Geng; Shuai Liu; Shuchen Shi; Tao; Wang; Chunyu Qiang; Chenxing Li; Ya Li; Zhengqi Wen; Yukun Liu; Xuefei Liu

arXiv:2412.08577·cs.SD·December 12, 2024

Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation

Hongming Guo, Ruibo Fu, Yizhong Geng, Shuai Liu, Shuchen Shi, Tao, Wang, Chunyu Qiang, Chenxing Li, Ya Li, Zhengqi Wen, Yukun Liu, Xuefei Liu

PDF

Open Access

TL;DR

Mel-Refine is a plug-and-play method that improves the detail and texture of Mel-spectrograms in text-to-audio models without retraining, significantly enhancing audio quality.

Contribution

The paper introduces Mel-Refine, a novel inference-time technique that refines Mel-spectrograms in diffusion-based TTA models, requiring no additional training.

Findings

01

Boosts Tango2 model performance by 25%

02

Enhances Mel-spectrogram detail and texture

03

Compatible with any diffusion-based TTA architecture

Abstract

Text-to-audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel-spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel-spectrograms for such audio often surpass the models' capacity, leading to outputs that are blurred or lack coherence. In this paper, we begin by investigating the critical role of U-Net in Mel-spectrogram generation. Our analysis shows that in U-Net structure, high-frequency components in skip-connections and the backbone influence texture and detail, while low-frequency components in the backbone are critical for the diffusion denoising process. We further propose ``Mel-Refine'', a plug-and-play approach that enhances Mel-spectrogram texture and detail by adjusting different component weights during inference.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing