MMGER: Multi-modal and Multi-granularity Generative Error Correction   with LLM for Joint Accent and Speech Recognition

Bingshen Mu; Yangze Li; Qijie Shao; Kun Wei; Xucheng Wan; Naijun; Zheng; Huan Zhou; Lei Xie

arXiv:2405.03152·eess.AS·May 8, 2024

MMGER: Multi-modal and Multi-granularity Generative Error Correction with LLM for Joint Accent and Speech Recognition

Bingshen Mu, Yangze Li, Qijie Shao, Kun Wei, Xucheng Wan, Naijun, Zheng, Huan Zhou, Lei Xie

PDF

Open Access

TL;DR

This paper introduces MMGER, a multi-modal, multi-granularity generative error correction model leveraging LLMs for improved multi-accent speech recognition, combining fine and coarse correction techniques.

Contribution

The paper proposes a novel unified ASR-AR GER model that integrates multi-modal and multi-granularity correction for multi-accent scenarios, enhancing error correction performance.

Findings

01

Achieved 26.72% relative improvement in accent recognition accuracy.

02

Reduced character error rate by 27.55%.

03

Demonstrated effectiveness on multi-accent Mandarin dataset.

Abstract

Despite notable advancements in automatic speech recognition (ASR), performance tends to degrade when faced with adverse conditions. Generative error correction (GER) leverages the exceptional text comprehension capabilities of large language models (LLM), delivering impressive performance in ASR error correction, where N-best hypotheses provide valuable information for transcription prediction. However, GER encounters challenges such as fixed N-best hypotheses, insufficient utilization of acoustic information, and limited specificity to multi-accent scenarios. In this paper, we explore the application of GER in multi-accent scenarios. Accents represent deviations from standard pronunciation norms, and the multi-task learning framework for simultaneous ASR and accent recognition (AR) has effectively addressed the multi-accent scenarios, making it a prominent solution. In this work, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems

MethodsSolana Customer Service Number +1-833-534-1729 · Graph Convolutional Network · Gait Emotion Recognition