Dissecting Temporal Understanding in Text-to-Audio Retrieval

Andreea-Maria Oncescu; Jo\~ao F. Henriques; A. Sophia Koepke

arXiv:2409.00851·cs.IR·September 4, 2024

Dissecting Temporal Understanding in Text-to-Audio Retrieval

Andreea-Maria Oncescu, Jo\~ao F. Henriques, A. Sophia Koepke

PDF

Open Access

TL;DR

This paper investigates the temporal understanding abilities of text-to-audio retrieval models, introduces a synthetic dataset for controlled evaluation, and proposes a loss function to enhance temporal comprehension.

Contribution

It provides a detailed analysis of temporal reasoning in text-to-audio models, introduces a new synthetic dataset, and proposes a loss function to improve temporal understanding.

Findings

01

Models have limited temporal understanding in text-to-audio retrieval.

02

Synthetic dataset enables controlled evaluation of temporal capabilities.

03

Proposed loss function improves temporal ordering performance.

Abstract

Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsFocus