Handling Numeric Expressions in Automatic Speech Recognition

Christian Huber; Alexander Waibel

arXiv:2408.00004·eess.AS·June 24, 2025

Handling Numeric Expressions in Automatic Speech Recognition

Christian Huber, Alexander Waibel

PDF

Open Access

TL;DR

This paper explores methods for accurately recognizing and formatting numeric expressions in speech transcripts, comparing cascaded and end-to-end models, and leveraging large language models for data generation.

Contribution

It introduces a data generation strategy using LLMs and TTS for training end-to-end models to improve numeric expression formatting in ASR.

Findings

01

End-to-end models with LLM-based data generation perform competitively.

02

Adapted end-to-end models offer lower latency and inference costs.

03

LLM-based approaches excel in recognizing formatted numeric expressions.

Abstract

This paper addresses the problem of correctly formatting numeric expressions in automatic speech recognition (ASR) transcripts. This is challenging since the expected transcript format depends on the context, e.g., 1945 (year) vs. 19:45 (timestamp). We compare cascaded and end-to-end approaches to recognize and format numeric expressions such as years, timestamps, currency amounts, and quantities. For the end-to-end approach, we employed a data generation strategy using a large language model (LLM) together with a text to speech (TTS) model to generate adaptation data. The results on our test data set show that while approaches based on LLMs perform well in recognizing formatted numeric expressions, adapted end-to-end models offer competitive performance with the advantage of lower latency and inference cost.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsSparse Evolutionary Training