Cross-modal Knowledge Transfer Learning as Graph Matching Based on Optimal Transport for ASR
Xugang Lu, Peng Shen, Yu Tsao, Hisashi Kawai

TL;DR
This paper introduces GM-OT, a graph matching optimal transport method that aligns linguistic and acoustic features structurally, improving knowledge transfer in end-to-end speech recognition.
Contribution
We propose GM-OT, a novel structured graph matching optimal transport framework that models and aligns linguistic and acoustic sequences for enhanced ASR performance.
Findings
Significant accuracy improvements on Mandarin ASR tasks.
Effective structured alignment of linguistic and acoustic features.
Theoretical unification of prior OT-based methods within GM-OT.
Abstract
Transferring linguistic knowledge from a pretrained language model (PLM) to acoustic feature learning has proven effective in enhancing end-to-end automatic speech recognition (E2E-ASR). However, aligning representations between linguistic and acoustic modalities remains a challenge due to inherent modality gaps. Optimal transport (OT) has shown promise in mitigating these gaps by minimizing the Wasserstein distance (WD) between linguistic and acoustic feature distributions. However, previous OT-based methods overlook structural relationships, treating feature vectors as unordered sets. To address this, we propose Graph Matching Optimal Transport (GM-OT), which models linguistic and acoustic sequences as structured graphs. Nodes represent feature embeddings, while edges capture temporal and sequential relationships. GM-OT minimizes both WD (between nodes) and Gromov-Wasserstein distance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing
