CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Muhammad Shakeel; Yosuke Fukumoto; Chikara Maeda; Chyi-Jiunn Lin; Shinji Watanabe

arXiv:2601.22792·eess.AS·May 14, 2026

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Muhammad Shakeel, Yosuke Fukumoto, Chikara Maeda, Chyi-Jiunn Lin, Shinji Watanabe

PDF

TL;DR

CALM introduces a joint acoustic-linguistic modeling framework for personalized multi-speaker ASR, effectively reducing error rates by integrating speaker cues and contextual biasing in an end-to-end system.

Contribution

It is the first to combine target-speaker extraction with contextual biasing in a unified end-to-end multi-speaker ASR model.

Findings

01

CALM reduces B-WER from 12.7 to 4.7 on LibriSpeech2Mix.

02

CALM decreases B-CER from 16.6 to 8.4 on CSJMix2 (eval3).

03

The framework performs well across English and Japanese speech mixtures.

Abstract

We present CALM, a joint Contextual Acoustic-Linguistic Modeling framework for multi-speaker automatic speech recognition (ASR). In personalized AI scenarios, the joint availability of acoustic and linguistic cues naturally motivates the integration of target-speaker conditioning with contextual biasing in overlapping conversations. CALM implements this integration in an end-to-end framework through speaker embedding-driven target-speaker extraction and dynamic vocabulary-based contextual biasing. We evaluate CALM on simulated English (LibriSpeechMix) and Japanese (Corpus of Spontaneous Japanese mixtures, CSJMix). On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.