Dissecting Temporal Understanding in Text-to-Audio Retrieval
Andreea-Maria Oncescu, Jo\~ao F. Henriques, A. Sophia Koepke

TL;DR
This paper investigates the temporal understanding abilities of text-to-audio retrieval models, introduces a synthetic dataset for controlled evaluation, and proposes a loss function to enhance temporal comprehension.
Contribution
It provides a detailed analysis of temporal reasoning in text-to-audio models, introduces a new synthetic dataset, and proposes a loss function to improve temporal understanding.
Findings
Models have limited temporal understanding in text-to-audio retrieval.
Synthetic dataset enables controlled evaluation of temporal capabilities.
Proposed loss function improves temporal ordering performance.
Abstract
Recent advancements in machine learning have fueled research on multimodal tasks, such as for instance text-to-video and text-to-audio retrieval. These tasks require models to understand the semantic content of video and audio data, including objects, and characters. The models also need to learn spatial arrangements and temporal relationships. In this work, we analyse the temporal ordering of sounds, which is an understudied problem in the context of text-to-audio retrieval. In particular, we dissect the temporal understanding capabilities of a state-of-the-art model for text-to-audio retrieval on the AudioCaps and Clotho datasets. Additionally, we introduce a synthetic text-audio dataset that provides a controlled setting for evaluating temporal capabilities of recent models. Lastly, we present a loss function that encourages text-audio models to focus on the temporal ordering of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Natural Language Processing Techniques · Speech Recognition and Synthesis
MethodsFocus
