Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Chao-Han Huck Yang; Taejin Park; Yuan Gong; Yuanchao Li; Zhehuai Chen; Yen-Ting Lin; Chen Chen; Yuchen Hu; Kunal Dhawan; Piotr \.Zelasko; Chao Zhang; Yun-Nung Chen; Yu Tsao; Jagadeesh Balam; Boris Ginsburg; Sabato Marco Siniscalchi; Eng Siong Chng; Peter Bell; Catherine Lai; Shinji Watanabe; Andreas Stolcke

arXiv:2409.09785·cs.CL·December 2, 2025

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Chao-Han Huck Yang, Taejin Park, Yuan Gong, Yuanchao Li, Zhehuai Chen, Yen-Ting Lin, Chen Chen, Yuchen Hu, Kunal Dhawan, Piotr \.Zelasko, Chao Zhang, Yun-Nung Chen, Yu Tsao, Jagadeesh Balam, Boris Ginsburg, Sabato Marco Siniscalchi, Eng Siong Chng, Peter Bell, Catherine Lai

PDF

Open Access

TL;DR

This paper introduces the GenSEC challenge to evaluate large language models' ability to improve speech recognition tasks, including transcription correction, speaker tagging, and emotion recognition, using open pretrained models.

Contribution

It presents a new challenge for assessing LLMs in speech processing tasks and provides baseline evaluations and insights for future research.

Findings

01

Baseline models show potential in error correction and speaker tagging.

02

Open pretrained LLMs can be adapted for speech-related tasks.

03

Lessons learned inform future evaluation designs.

Abstract

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition. These tasks aim to emulate future LLM-based agents handling voice-based interfaces while remaining accessible to a broad audience by utilizing open pretrained language models or agent-based APIs. We also discuss insights from baseline evaluations, as well as lessons learned for designing future evaluations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis