Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

Xugang Lu; Peng Shen; Yu Tsao; Hisashi Kawai

arXiv:2505.13079·eess.AS·May 20, 2025

Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR

Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

PDF

Open Access

TL;DR

This paper introduces GM-OT, a graph matching optimal transport method that aligns linguistic and acoustic features structurally, improving knowledge transfer in end-to-end speech recognition.

Contribution

We propose GM-OT, a novel structured graph matching optimal transport framework that models and aligns linguistic and acoustic sequences for enhanced ASR performance.

Findings

01

Significant accuracy improvements on Mandarin ASR tasks.

02

Effective structured alignment of linguistic and acoustic features.

03

Theoretical unification of prior OT-based methods within GM-OT.

Abstract

Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the Wasserstein distance (WD) between linguistic and acoustic feature distributions. However, previous OT-based methods overlook structural relationships, treating feature vectors as unordered sets. To address this, we propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs. Nodes represent feature embeddings, while edges capture temporal and sequential relationships. GM-OT minimizes both WD (between nodes) and Gromov-Wasserstein distance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing