Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting

Ramesh Gundluru; Shubham Gupta; Sri Rama Murty K

arXiv:2512.14115·cs.SD·December 17, 2025

Joint Multimodal Contrastive Learning for Robust Spoken Term Detection and Keyword Spotting

Ramesh Gundluru, Shubham Gupta, Sri Rama Murty K

PDF

Open Access

TL;DR

This paper introduces a joint multimodal contrastive learning framework that unifies acoustic and cross-modal supervision to improve spoken term detection and keyword spotting, outperforming existing methods.

Contribution

It presents the first comprehensive joint contrastive learning approach combining audio-text and audio-audio alignment for AWEs in speech retrieval tasks.

Findings

01

Outperforms existing AWE baselines on word discrimination tasks

02

Supports both STD and KWS with improved robustness

03

Unifies multimodal supervision in a shared embedding space

Abstract

Acoustic Word Embeddings (AWEs) improve the efficiency of speech retrieval tasks such as Spoken Term Detection (STD) and Keyword Spotting (KWS). However, existing approaches suffer from limitations, including unimodal supervision, disjoint optimization of audio-audio and audio-text alignment, and the need for task-specific models. To address these shortcomings, we propose a joint multimodal contrastive learning framework that unifies both acoustic and cross-modal supervision in a shared embedding space. Our approach simultaneously optimizes: (i) audio-text contrastive learning, inspired by the CLAP loss, to align audio and text representations and (ii) audio-audio contrastive learning, via Deep Word Discrimination (DWD) loss, to enhance intra-class compactness and inter-class separation. The proposed method outperforms existing AWE baselines on word discrimination task while flexibly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems