CosyAudio: Improving Audio Generation with Confidence Scores and   Synthetic Captions

Xinfa Zhu; Wenjie Tian; Xinsheng Wang; Lei He; Xi Wang; Sheng Zhao,; Lei Xie

arXiv:2501.16761·eess.AS·January 29, 2025

CosyAudio: Improving Audio Generation with Confidence Scores and Synthetic Captions

Xinfa Zhu, Wenjie Tian, Xinsheng Wang, Lei He, Xi Wang, Sheng Zhao,, Lei Xie

PDF

Open Access

TL;DR

CosyAudio introduces a confidence score-based framework with synthetic captions to improve text-to-audio generation, addressing data scarcity and noisy labels, and demonstrating superior performance and generalization.

Contribution

It proposes a novel confidence-aware framework with synthetic captions and a self-evolving training strategy for robust audio generation from text.

Findings

01

Outperforms existing models in automated audio captioning

02

Generates more faithful and higher-quality audio

03

Shows strong generalization across diverse datasets

Abstract

Text-to-Audio (TTA) generation is an emerging area within AI-generated content (AIGC), where audio is created from natural language descriptions. Despite growing interest, developing robust TTA models remains challenging due to the scarcity of well-labeled datasets and the prevalence of noisy or inaccurate captions in large-scale, weakly labeled corpora. To address these challenges, we propose CosyAudio, a novel framework that utilizes confidence scores and synthetic captions to enhance the quality of audio generation. CosyAudio consists of two core components: AudioCapTeller and an audio generator. AudioCapTeller generates synthetic captions for audio and provides confidence scores to evaluate their accuracy. The audio generator uses these synthetic captions and confidence scores to enable quality-aware audio generation. Additionally, we introduce a self-evolving training strategy that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Subtitles and Audiovisual Media