Handling Numeric Expressions in Automatic Speech Recognition
Christian Huber, Alexander Waibel

TL;DR
This paper explores methods for accurately recognizing and formatting numeric expressions in speech transcripts, comparing cascaded and end-to-end models, and leveraging large language models for data generation.
Contribution
It introduces a data generation strategy using LLMs and TTS for training end-to-end models to improve numeric expression formatting in ASR.
Findings
End-to-end models with LLM-based data generation perform competitively.
Adapted end-to-end models offer lower latency and inference costs.
LLM-based approaches excel in recognizing formatted numeric expressions.
Abstract
This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expressions such as years, timestamps, currency amounts, and quantities. For the end-to-end approach, we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test data set show that while approaches based on LLMs perform well in recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
MethodsSparse Evolutionary Training
