TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
Paul Primus, Florian Schmid, Gerhard Widmer

TL;DR
This paper introduces TACOS, a model trained with temporally aligned audio and text data, improving the ability to associate specific audio segments with descriptive text, which enhances temporal alignment in audio-text tasks.
Contribution
We curated a new dataset with temporal annotations and developed a frame-wise contrastive training method to improve temporal alignment in language-audio models.
Findings
Enhanced temporal text-audio alignment demonstrated on AudioSet benchmark.
Our dataset enables more precise training for temporal audio-text tasks.
The proposed method outperforms models trained only on global captions.
Abstract
Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models - particularly, if they are expected to produce frame-level embeddings - can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsALIGN
