Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
Hongming Guo, Ruibo Fu, Yizhong Geng, Shuai Liu, Shuchen Shi, Tao, Wang, Chunyu Qiang, Chenxing Li, Ya Li, Zhengqi Wen, Yukun Liu, Xuefei Liu

TL;DR
Mel-Refine is a plug-and-play method that improves the detail and texture of Mel-spectrograms in text-to-audio models without retraining, significantly enhancing audio quality.
Contribution
The paper introduces Mel-Refine, a novel inference-time technique that refines Mel-spectrograms in diffusion-based TTA models, requiring no additional training.
Findings
Boosts Tango2 model performance by 25%
Enhances Mel-spectrogram detail and texture
Compatible with any diffusion-based TTA architecture
Abstract
Text-to-audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel-spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel-spectrograms for such audio often surpass the models' capacity, leading to outputs that are blurred or lack coherence. In this paper, we begin by investigating the critical role of U-Net in Mel-spectrogram generation. Our analysis shows that in U-Net structure, high-frequency components in skip-connections and the backbone influence texture and detail, while low-frequency components in the backbone are critical for the diffusion denoising process. We further propose ``Mel-Refine'', a plug-and-play approach that enhances Mel-spectrogram texture and detail by adjusting different component weights during inference.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing
