Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Yuchen Zhang; Haralambos Mouratidis; Ravi Shekhar

arXiv:2603.06505·cs.CL·March 9, 2026

Speak in Context: Multilingual ASR with Speech Context Alignment via Contrastive Learning

Yuchen Zhang, Haralambos Mouratidis, Ravi Shekhar

PDF

Open Access

TL;DR

This paper presents a multilingual ASR system that effectively integrates speech context using contrastive learning, enhancing recognition accuracy across diverse languages and dialects while maintaining modularity and flexibility.

Contribution

It introduces a novel contrastive learning approach for aligning speech and contextual representations, enabling improved multilingual and context-aware speech recognition.

Findings

01

Contextual input improves recognition accuracy by over 5%.

02

Contrastive alignment enhances performance across different context types.

03

System supports 11 languages and 5 English dialects with consistent gains.

Abstract

Automatic speech recognition (ASR) has benefited from advances in pretrained speech and language models, yet most systems remain constrained to monolingual settings and short, isolated utterances. While recent efforts in context-aware ASR show promise, two key challenges persist: limited multilingual support and the absence of principled alignment between speech and contextual representations. In this paper, we introduce a context-aware multilingual ASR framework that supports diverse languages and accents while preserving the modularity of pretrained models. Our approach combines a frozen speech encoder and a decoder-only language model via a lightweight projection module, allowing structured context prompts, including dialogue history and biasing words, to guide transcription. To improve interaction between speech and context, we employ a contrastive learning objective that aligns…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Emotion and Mood Recognition