PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding

Jiajun He; Tomoki Toda

arXiv:2506.11064·eess.AS·June 16, 2025

PMF-CEC: Phoneme-augmented Multimodal Fusion for Context-aware ASR Error Correction with Error-specific Selective Decoding

Jiajun He, Tomoki Toda

PDF

Open Access

TL;DR

This paper introduces PMF-CEC, a novel phoneme-augmented multimodal fusion approach that enhances context-aware ASR error correction, especially for homophones, by improving differentiation and reducing bias, while maintaining fast inference.

Contribution

The paper proposes PMF-CEC, an advanced method that improves rare word correction in ASR by integrating phoneme information and a retention mechanism for better error detection.

Findings

01

PMF-CEC reduces word error rate more effectively than ED-CEC.

02

The method outperforms other biasing techniques in correcting homophones.

03

PMF-CEC maintains fast inference speed and robustness against large biasing lists.

Abstract

End-to-end automatic speech recognition (ASR) models often struggle to accurately recognize rare words. Previously, we introduced an ASR postprocessing method called error detection and context-aware error correction (ED-CEC), which leverages contextual information such as named entities and technical terms to improve the accuracy of ASR transcripts. Although ED-CEC achieves a notable success in correcting rare words, its accuracy remains low when dealing with rare words that have similar pronunciations but different spellings. To address this issue, we proposed a phoneme-augmented multimodal fusion method for context-aware error correction (PMF-CEC) method on the basis of ED-CEC, which allowed for better differentiation between target rare words and homophones. Additionally, we observed that the previous ASR error detection module suffers from overdetection. To mitigate this, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems