Exploring the Integration of Large Language Models into Automatic Speech Recognition Systems: An Empirical Study
Zeping Min, Jinbo Wang

TL;DR
This study investigates integrating Large Language Models into ASR systems to improve transcription accuracy, but finds current LLM capabilities insufficient for effective error correction in speech recognition tasks.
Contribution
The paper provides an empirical evaluation of LLMs in ASR, highlighting the challenges and limitations of using in-context learning for speech transcription error correction.
Findings
LLMs did not significantly improve ASR accuracy.
Corrected transcriptions often increased Word Error Rates.
Current LLMs face limitations in speech-related applications.
Abstract
This paper explores the integration of Large Language Models (LLMs) into Automatic Speech Recognition (ASR) systems to improve transcription accuracy. The increasing sophistication of LLMs, with their in-context learning capabilities and instruction-following behavior, has drawn significant attention in the field of Natural Language Processing (NLP). Our primary focus is to investigate the potential of using an LLM's in-context learning capabilities to enhance the performance of ASR systems, which currently face challenges such as ambient noise, speaker accents, and complex linguistic contexts. We designed a study using the Aishell-1 and LibriSpeech datasets, with ChatGPT and GPT-4 serving as benchmarks for LLM capabilities. Unfortunately, our initial experiments did not yield promising results, indicating the complexity of leveraging LLM's in-context learning for ASR applications.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsAttention Is All You Need · Label Smoothing · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Absolute Position Encodings · Linear Layer · Residual Connection · Adam · Dense Connections · Dropout
