Enhancing Audio-Language Models through Self-Supervised Post-Training   with Text-Audio Pairs

Anshuman Sinha; Camille Migozzi; Aubin Rey; Chao Zhang

arXiv:2408.09269·cs.SD·April 22, 2025

Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs

Anshuman Sinha, Camille Migozzi, Aubin Rey, Chao Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces TeminAL, a two-stage training method that enhances audio-language models with temporal understanding, improving their performance on temporal tasks while maintaining their zero-shot capabilities.

Contribution

The paper proposes TeminAL, a novel two-stage training scheme that instills temporal understanding in contrastive audio-language models without sacrificing their existing capabilities.

Findings

01

Achieved an average 5.28% performance gain in temporal understanding on ESC-50.

02

Maintained competitive zero-shot retrieval and classification performance.

03

Proposed ZSTE, a new evaluation strategy for zero-shot contrastive models.

Abstract

Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A $&$ B, where the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Camille112/T-CLAP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing

MethodsContrastive Learning