T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

Yi Yuan; Zhuo Chen; Xubo Liu; Haohe Liu; Xuenan Xu; Dongya Jia,; Yuanzhe Chen; Mark D. Plumbley; Wenwu Wang

arXiv:2404.17806·cs.SD·April 30, 2024

T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining

Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia,, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang

PDF

Open Access

TL;DR

T-CLAP enhances contrastive language-audio pretraining by integrating temporal information through synthetic captions and a new loss, significantly improving sound event temporal relationship understanding and outperforming existing models.

Contribution

The paper introduces T-CLAP, a novel temporal-enhanced CLAP model that incorporates synthetic temporal captions and a specialized loss to better capture temporal dynamics in audio-text representations.

Findings

01

T-CLAP outperforms state-of-the-art models in multiple downstream tasks.

02

The model demonstrates improved understanding of temporal relationships in sound events.

03

Synthetic temporal captions effectively enhance the model's temporal feature learning.

Abstract

Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Phonetics and Phonology Research

MethodsALIGN