T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
Yi Yuan, Zhuo Chen, Xubo Liu, Haohe Liu, Xuenan Xu, Dongya Jia,, Yuanzhe Chen, Mark D. Plumbley, Wenwu Wang

TL;DR
T-CLAP enhances contrastive language-audio pretraining by integrating temporal information through synthetic captions and a new loss, significantly improving sound event temporal relationship understanding and outperforming existing models.
Contribution
The paper introduces T-CLAP, a novel temporal-enhanced CLAP model that incorporates synthetic temporal captions and a specialized loss to better capture temporal dynamics in audio-text representations.
Findings
T-CLAP outperforms state-of-the-art models in multiple downstream tasks.
The model demonstrates improved understanding of temporal relationships in sound events.
Synthetic temporal captions effectively enhance the model's temporal feature learning.
Abstract
Contrastive language-audio pretraining~(CLAP) has been developed to align the representations of audio and language, achieving remarkable performance in retrieval and classification tasks. However, current CLAP struggles to capture temporal information within audio and text features, presenting substantial limitations for tasks such as audio retrieval and generation. To address this gap, we introduce T-CLAP, a temporal-enhanced CLAP model. We use Large Language Models~(LLMs) and mixed-up strategies to generate temporal-contrastive captions for audio clips from extensive audio-text datasets. Subsequently, a new temporal-focused contrastive loss is designed to fine-tune the CLAP model by incorporating these synthetic data. We conduct comprehensive experiments and analysis in multiple downstream tasks. T-CLAP shows improved capability in capturing the temporal relationship of sound events…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Phonetics and Phonology Research
MethodsALIGN
