Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning
Yuchen Zhang, Haralambos Mouratidis, Ravi Shekhar

TL;DR
This paper presents a multilingual ASR system that effectively integrates speech context using contrastive learning, enhancing recognition accuracy across diverse languages and dialects while maintaining modularity and flexibility.
Contribution
It introduces a novel contrastive learning approach for aligning speech and contextual representations, enabling improved multilingual and context-aware speech recognition.
Findings
Contextual input improves recognition accuracy by over 5%.
Contrastive alignment enhances performance across different context types.
System supports 11 languages and 5 English dialects with consistent gains.
Abstract
Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Emotion and Mood Recognition
