Achieving Timestamp Prediction While Recognizing with Non-Autoregressive End-to-End ASR Model
Xian Shi, Yanni Chen, Shiliang Zhang, and Zhijie Yan

TL;DR
This paper introduces a method for timestamp prediction within end-to-end non-autoregressive ASR models using CIF, improving timestamp accuracy and integrating recognition and alignment tasks.
Contribution
It proposes a scaled-CIF mechanism and post-processing strategies to enhance timestamp prediction in end-to-end ASR, addressing limitations of traditional force-alignment.
Findings
Significant reduction in AAS and DER metrics with proposed methods
Improved timestamp accuracy over conventional force-alignment
Effective integration of timestamp prediction in end-to-end ASR
Abstract
Conventional ASR systems use frame-level phoneme posterior to conduct force-alignment~(FA) and provide timestamps, while end-to-end ASR systems especially AED based ones are short of such ability. This paper proposes to perform timestamp prediction~(TP) while recognizing by utilizing continuous integrate-and-fire~(CIF) mechanism in non-autoregressive ASR model - Paraformer. Foucing on the fire place bias issue of CIF, we conduct post-processing strategies including fire-delay and silence insertion. Besides, we propose to use scaled-CIF to smooth the weights of CIF output, which is proved beneficial for both ASR and TP task. Accumulated averaging shift~(AAS) and diarization error rate~(DER) are adopted to measure the quality of timestamps and we compare these metrics of proposed system and conventional hybrid force-alignment system. The experiment results over manually-marked timestamps…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · EEG and Brain-Computer Interfaces · Gaze Tracking and Assistive Technology
