Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

Luong Ho; Khanh Le; Vinh Pham; Bao Nguyen; Tan Tran; Duc Chau

arXiv:2505.24229·cs.CL·June 2, 2025

Dynamic Context-Aware Streaming Pretrained Language Model For Inverse Text Normalization

Luong Ho, Khanh Le, Vinh Pham, Bao Nguyen, Tan Tran, Duc Chau

PDF

Open Access

TL;DR

This paper presents a novel dynamic context-aware streaming pretrained language model for inverse text normalization, improving accuracy and efficiency in converting ASR outputs to well-formatted text, especially in low-resource scenarios.

Contribution

It introduces a new streaming ITN model leveraging pretrained representations and dynamic context adaptation, addressing challenges in streaming accuracy and latency.

Findings

01

Achieves accuracy comparable to non-streaming models.

02

Outperforms existing streaming ITN models on Vietnamese data.

03

Maintains low latency suitable for real-time ASR integration.

Abstract

Inverse Text Normalization (ITN) is crucial for converting spoken Automatic Speech Recognition (ASR) outputs into well-formatted written text, enhancing both readability and usability. Despite its importance, the integration of streaming ITN within streaming ASR remains largely unexplored due to challenges in accuracy, efficiency, and adaptability, particularly in low-resource and limited-context scenarios. In this paper, we introduce a streaming pretrained language model for ITN, leveraging pretrained linguistic representations for improved robustness. To address streaming constraints, we propose Dynamic Context-Aware during training and inference, enabling adaptive chunk size adjustments and the integration of right-context information. Experimental results demonstrate that our method achieves accuracy comparable to non-streaming ITN and surpasses existing streaming ITN models on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis