Benchmarking Japanese Speech Recognition on ASR-LLM Setups with   Multi-Pass Augmented Generative Error Correction

Yuka Ko; Sheng Li; Chao-Han Huck Yang; Tatsuya Kawahara

arXiv:2408.16180·eess.AS·October 14, 2024

Benchmarking Japanese Speech Recognition on ASR-LLM Setups with Multi-Pass Augmented Generative Error Correction

Yuka Ko, Sheng Li, Chao-Han Huck Yang, Tatsuya Kawahara

PDF

Open Access

TL;DR

This paper introduces a new benchmark and multi-pass generative error correction method using large language models to improve Japanese speech recognition accuracy, demonstrating significant performance gains.

Contribution

It presents the first Japanese GER benchmark and a novel multi-pass correction technique that leverages multiple hypotheses and LLMs for enhanced ASR performance.

Findings

01

Performance improvements in ASR quality on Japanese datasets

02

Effective integration of multiple hypotheses with LLM corrections

03

Demonstrated generalization across different Japanese speech datasets

Abstract

With the strong representational power of large language models (LLMs), generative error correction (GER) for automatic speech recognition (ASR) aims to provide semantic and phonetic refinements to address ASR errors. This work explores how LLM-based GER can enhance and expand the capabilities of Japanese language processing, presenting the first GER benchmark for Japanese ASR with 0.9-2.6k text utterances. We also introduce a new multi-pass augmented generative error correction (MPA GER) by integrating multiple system hypotheses on the input side with corrections from multiple LLMs on the output side and then merging them. To the best of our knowledge, this is the first investigation of the use of LLMs for Japanese GER, which involves second-pass language modeling on the output transcriptions generated by the ASR system (e.g., N-best hypotheses). Our experiments demonstrated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsSolana Customer Service Number +1-833-534-1729 · Graph Convolutional Network · Gait Emotion Recognition