AudioTime: A Temporally-aligned Audio-text Benchmark Dataset

Zeyu Xie; Xuenan Xu; Zhizheng Wu; and Mengyue Wu

arXiv:2407.02857·cs.SD·July 4, 2024

AudioTime: A Temporally-aligned Audio-text Benchmark Dataset

Zeyu Xie, Xuenan Xu, Zhizheng Wu, and Mengyue Wu

PDF

Open Access

TL;DR

AudioTime is a new dataset that provides high-quality, temporally-aligned audio-text annotations to improve models' ability to understand and control the timing of sound events from textual descriptions.

Contribution

The paper introduces AudioTime, a dataset with detailed temporal annotations, and evaluation tools to enhance temporal controllability in audio generation models.

Findings

01

Dataset covers comprehensive temporal aspects of audio.

02

Provides benchmarks for temporal control performance.

03

Enables training of models with improved temporal accuracy.

Abstract

Recent advancements in audio generation have enabled the creation of high-fidelity audio clips from free-form textual descriptions. However, temporal relationships, a critical feature for audio content, are currently underrepresented in mainstream models, resulting in an imprecise temporal controllability. Specifically, users cannot accurately control the timestamps of sound events using free-form text. We acknowledge that a significant factor is the absence of high-quality, temporally-aligned audio-text datasets, which are essential for training models with temporal control. The more temporally-aligned the annotations, the better the models can understand the precise relationship between audio outputs and temporal textual prompts. Therefore, we present a strongly aligned audio-text dataset, AudioTime. It provides text annotations rich in temporal information such as timestamps,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis