MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition
Bingshen Mu, Yangze Li, Qijie Shao, Kun Wei, Xucheng Wan, Naijun, Zheng, Huan Zhou, Lei Xie

TL;DR
This paper introduces MMGER, a multi-modal, multi-granularity generative error correction model leveraging LLMs for improved multi-accent speech recognition, combining fine and coarse correction techniques.
Contribution
The paper proposes a novel unified ASR-AR GER model that integrates multi-modal and multi-granularity correction for multi-accent scenarios, enhancing error correction performance.
Findings
Achieved 26.72% relative improvement in accent recognition accuracy.
Reduced character error rate by 27.55%.
Demonstrated effectiveness on multi-accent Mandarin dataset.
Abstract
Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-accent scenarios. In this paper, we explore the application of GER in multi-accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-accent scenarios, making it a prominent solution. In this work, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems
MethodsSolana Customer Service Number +1-833-534-1729 · Graph Convolutional Network · Gait Emotion Recognition
