Enhancing Audio-Language Models through Self-Supervised Post-Training with Text-Audio Pairs
Anshuman Sinha, Camille Migozzi, Aubin Rey, Chao Zhang

TL;DR
This paper introduces TeminAL, a two-stage training method that enhances audio-language models with temporal understanding, improving their performance on temporal tasks while maintaining their zero-shot capabilities.
Contribution
The paper proposes TeminAL, a novel two-stage training scheme that instills temporal understanding in contrastive audio-language models without sacrificing their existing capabilities.
Findings
Achieved an average 5.28% performance gain in temporal understanding on ESC-50.
Maintained competitive zero-shot retrieval and classification performance.
Proposed ZSTE, a new evaluation strategy for zero-shot contrastive models.
Abstract
Research on multi-modal contrastive learning strategies for audio and text has rapidly gained interest. Contrastively trained Audio-Language Models (ALMs), such as CLAP, which establish a unified representation across audio and language modalities, have enhanced the efficacy in various subsequent tasks by providing good text aligned audio encoders and vice versa. These improvements are evident in areas like zero-shot audio classification and audio retrieval, among others. However, the ability of these models to understand natural language and temporal relations is still a largely unexplored and open field for research. In this paper, we propose to equip the multi-modal ALMs with temporal understanding without loosing their inherent prior capabilities of audio-language tasks with a temporal instillation method TeminAL. We implement a two-stage training scheme TeminAL A B, where the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing
MethodsContrastive Learning
