Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

Youngmoon Jung; Yong-Hyeok Lee; Myunghun Jung; Jaeyoung Roh; Chang Woo Han; Hoon-Young Cho

arXiv:2505.16735·eess.AS·May 26, 2025

Adversarial Deep Metric Learning for Cross-Modal Audio-Text Alignment in Open-Vocabulary Keyword Spotting

Youngmoon Jung, Yong-Hyeok Lee, Myunghun Jung, Jaeyoung Roh, Chang Woo Han, Hoon-Young Cho

PDF

TL;DR

This paper introduces a novel adversarial deep metric learning framework for cross-modal audio-text alignment in open-vocabulary keyword spotting, reducing modality heterogeneity and improving phoneme-level alignment.

Contribution

It proposes Modality Adversarial Learning (MAL) to generate modality-invariant embeddings and applies deep metric learning for effective audio-text alignment.

Findings

01

Improved keyword spotting accuracy on WSJ and LibriPhrase datasets.

02

Effective reduction of modality gap through adversarial training.

03

Enhanced phoneme-level alignment between audio and text.

Abstract

For text enrollment-based open-vocabulary keyword spotting (KWS), acoustic and text embeddings are typically compared at either the phoneme or utterance level. To facilitate this, we optimize acoustic and text encoders using deep metric learning (DML), enabling direct comparison of multi-modal embeddings in a shared embedding space. However, the inherent heterogeneity between audio and text modalities presents a significant challenge. To address this, we propose Modality Adversarial Learning (MAL), which reduces the domain gap in heterogeneous modality representations. Specifically, we train a modality classifier adversarially to encourage both encoders to generate modality-invariant embeddings. Additionally, we apply DML to achieve phoneme-level alignment between audio and text, and conduct extensive comparisons across various DML objectives. Experiments on the Wall Street Journal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.