PicoAudio: Enabling Precise Timestamp and Frequency Controllability of   Audio Events in Text-to-audio Generation

Zeyu Xie; Xuenan Xu; Zhizheng Wu; and Mengyue Wu

arXiv:2407.02869·cs.SD·July 18, 2024·1 cites

PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation

Zeyu Xie, Xuenan Xu, Zhizheng Wu, and Mengyue Wu

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

PicoAudio is a novel framework that enables precise timestamp and frequency control in text-to-audio generation by integrating temporal information and fine-grained data processing, significantly outperforming existing models.

Contribution

The paper introduces PicoAudio, a new model that incorporates temporal information for controllable audio generation, addressing a key challenge in the field.

Findings

01

PicoAudio achieves superior controllability in timestamp and frequency.

02

Subjective and objective evaluations show significant improvements.

03

Generated samples demonstrate practical effectiveness.

Abstract

Recently, audio generation tasks have attracted considerable research interests. Precise temporal controllability is essential to integrate audio generation with real applications. In this work, we propose a temporal controlled audio generation framework, PicoAudio. PicoAudio integrates temporal information to guide audio generation through tailored model design. It leverages data crawling, segmentation, filtering, and simulation of fine-grained temporally-aligned audio-text data. Both subjective and objective evaluations demonstrate that PicoAudio dramantically surpasses current state-of-the-art generation models in terms of timestamp and occurrence frequency controllability. The generated samples are available on the demo website https://zeyuxie29.github.io/PicoAudio.github.io.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

picoaudio/picoaudio
pytorchOfficial

Models

🤗
ZeyuXie/PicoAudio
model

Datasets

amphion/PicoAudio
dataset· 18 dl
18 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Music Technology and Sound Studies