RiTTA: Modeling Event Relations in Text-to-Audio Generation

Yuhang He; Yash Jain; Xubo Liu; Andrew Markham; Vibhav Vineet

arXiv:2412.15922·cs.LG·April 10, 2026

RiTTA: Modeling Event Relations in Text-to-Audio Generation

Yuhang He, Yash Jain, Xubo Liu, Andrew Markham, Vibhav Vineet

PDF

1 Repo 1 Video

TL;DR

This paper introduces a new benchmark and evaluation metrics for modeling relations between audio events in text-to-audio generation, along with a finetuning framework to improve existing models.

Contribution

It systematically studies audio event relation modeling, creates comprehensive datasets, and proposes a finetuning method to enhance TTA models' relational understanding.

Findings

01

Established a relation corpus covering real-world scenarios

02

Created a new audio event corpus with common sounds

03

Proposed evaluation metrics for audio event relation modeling

Abstract

Despite significant advancements in Text-to-Audio (TTA) generation models achieving high-fidelity audio with fine-grained context understanding, they struggle to model the relations between audio events described in the input text. However, previous TTA methods have not systematically explored audio event relation modeling, nor have they proposed frameworks to enhance this capability. In this work, we systematically study audio event relation modeling in TTA generation models. We first establish a benchmark for this task by: 1. proposing a comprehensive relation corpus covering all potential relations in real-world scenarios; 2. introducing a new audio event corpus encompassing commonly heard audios; and 3. proposing new evaluation metrics to assess audio event relation modeling from various perspectives. Furthermore, we propose a finetuning framework to enhance existing TTA models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuhanghe01/RiTTA
github

Videos

RiTTA: Modeling Event Relations in Text-to-Audio Generation· underline