TL;DR
This paper introduces a new benchmark and evaluation metrics for modeling relations between audio events in text-to-audio generation, along with a finetuning framework to improve existing models.
Contribution
It systematically studies audio event relation modeling, creates comprehensive datasets, and proposes a finetuning method to enhance TTA models' relational understanding.
Findings
Established a relation corpus covering real-world scenarios
Created a new audio event corpus with common sounds
Proposed evaluation metrics for audio event relation modeling
Abstract
Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
