TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining

Paul Primus; Florian Schmid; Gerhard Widmer

arXiv:2505.07609·eess.AS·May 13, 2025

TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining

Paul Primus, Florian Schmid, Gerhard Widmer

PDF

Open Access

TL;DR

This paper introduces TACOS, a model trained with temporally aligned audio and text data, improving the ability to associate specific audio segments with descriptive text, which enhances temporal alignment in audio-text tasks.

Contribution

We curated a new dataset with temporal annotations and developed a frame-wise contrastive training method to improve temporal alignment in language-audio models.

Findings

01

Enhanced temporal text-audio alignment demonstrated on AudioSet benchmark.

02

Our dataset enables more precise training for temporal audio-text tasks.

03

The proposed method outperforms models trained only on global captions.

Abstract

Learning to associate audio with textual descriptions is valuable for a range of tasks, including pretraining, zero-shot classification, audio retrieval, audio captioning, and text-conditioned audio generation. Existing contrastive language-audio pretrained models are typically trained using global, clip-level descriptions, which provide only weak temporal supervision. We hypothesize that CLAP-like language-audio models - particularly, if they are expected to produce frame-level embeddings - can benefit from a stronger temporal supervision. To confirm our hypothesis, we curate a novel dataset of approximately 12,000 audio recordings from Freesound, each annotated with single-sentence free-text descriptions linked to a specific temporal segment in an audio recording. We use large language models to clean these annotations by removing references to non-audible events, transcribed speech,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsALIGN