Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Ke Hu; Krishna Puvvada; Elena Rastorgueva; Zhehuai Chen; He Huang; Shuoyang Ding; Kunal Dhawan; Hainan Xu; Jagadeesh Balam; Boris Ginsburg

arXiv:2505.15646·cs.CL·May 22, 2025

Word Level Timestamp Generation for Automatic Speech Recognition and Translation

Ke Hu, Krishna Puvvada, Elena Rastorgueva, Zhehuai Chen, He Huang, Shuoyang Ding, Kunal Dhawan, Hainan Xu, Jagadeesh Balam, Boris Ginsburg

PDF

Open Access 1 Repo

TL;DR

This paper presents a data-driven method for enabling word-level timestamp prediction directly within an end-to-end speech recognition model, improving downstream task performance without external alignment modules.

Contribution

The authors introduce a novel <|timestamp|> token in the Canary model, allowing direct timestamp prediction and extending the approach to speech translation tasks.

Findings

01

Timestamp prediction accuracy between 80-90%.

02

Error rates range from 20 to 120 ms across four languages.

03

Minimal impact on word error rate (WER).

Abstract

We introduce a data-driven approach for enabling word-level timestamp prediction in the Canary model. Accurate timestamp information is crucial for a variety of downstream tasks such as speech content retrieval and timed subtitles. While traditional hybrid systems and end-to-end (E2E) models may employ external modules for timestamp prediction, our approach eliminates the need for separate alignment mechanisms. By leveraging the NeMo Forced Aligner (NFA) as a teacher model, we generate word-level timestamps and train the Canary model to predict timestamps directly. We introduce a new <|timestamp|> token, enabling the Canary model to predict start and end timestamps for each word. Our method demonstrates precision and recall rates between 80% and 90%, with timestamp prediction errors ranging from 20 to 120 ms across four languages, with minimal WER degradation. Additionally, we extend…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

NVIDIA/NeMo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems