Text-to-Audio Grounding: Building Correspondence Between Captions and   Sound Events

Xuenan Xu; Heinrich Dinkel; Mengyue Wu; Kai Yu

arXiv:2102.11474·cs.SD·February 24, 2021·1 cites

Text-to-Audio Grounding: Building Correspondence Between Captions and Sound Events

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

PDF

Open Access 1 Repo

TL;DR

This paper introduces the Text-to-Audio Grounding task and dataset, aiming to link captions with specific sound events in audio clips, advancing cross-modal audio understanding.

Contribution

It presents a new dataset and task for grounding captions in audio, bridging audio processing and language understanding.

Findings

01

Developed the AudioGrounding dataset with sound event locations.

02

Proposed a baseline approach achieving 28.3% event-F1 score.

03

Achieved a 14.7% PSDS score on the task.

Abstract

Automated Audio Captioning is a cross-modal task, generating natural language descriptions to summarize the audio clips' sound events. However, grounding the actual sound events in the given audio based on its corresponding caption has not been investigated. This paper contributes an AudioGrounding dataset, which provides the correspondence between sound events and the captions provided in Audiocaps, along with the location (timestamps) of each present sound event. Based on such, we propose the text-to-audio grounding (TAG) task, which interactively considers the relationship between audio processing and language understanding. A baseline approach is provided, resulting in an event-F1 score of 28.3% and a Polyphonic Sound Detection Score (PSDS) score of 14.7%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wsntxxn/TextToAudioGrounding
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis