Less Peaky and More Accurate CTC Forced Alignment by Label Priors

Ruizhe Huang; Xiaohui Zhang; Zhaoheng Ni; Li Sun; Moto Hira; Jeff; Hwang; Vimal Manohar; Vineel Pratap; Matthew Wiesner; Shinji Watanabe; Daniel; Povey; Sanjeev Khudanpur

arXiv:2406.02560·eess.AS·July 22, 2024

Less Peaky and More Accurate CTC Forced Alignment by Label Priors

Ruizhe Huang, Xiaohui Zhang, Zhaoheng Ni, Li Sun, Moto Hira, Jeff, Hwang, Vimal Manohar, Vineel Pratap, Matthew Wiesner, Shinji Watanabe, Daniel, Povey, Sanjeev Khudanpur

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces a modified CTC model that reduces peaky output distributions by leveraging label priors, leading to more accurate forced alignments at phoneme and word levels, with improved efficiency and comparable performance to existing tools.

Contribution

It proposes a novel approach to mitigate CTC peaky behavior using label priors, enhancing alignment accuracy and training simplicity.

Findings

01

Reduces phoneme and word boundary errors by 12-40%.

02

Produces less peaky posteriors and more accurate token offsets.

03

Offers a simpler, more efficient training pipeline.

Abstract

Connectionist temporal classification (CTC) models are known to have peaky output distributions. Such behavior is not a problem for automatic speech recognition (ASR), but it can cause inaccurate forced alignments (FA), especially at finer granularity, e.g., phoneme level. This paper aims at alleviating the peaky behavior for CTC and improve its suitability for forced alignment generation, by leveraging label priors, so that the scores of alignment paths containing fewer blanks are boosted and maximized during training. As a result, our CTC model produces less peaky posteriors and is able to more accurately predict the offset of the tokens besides their onset. It outperforms the standard CTC model and a heuristics-based approach for obtaining CTC's token offset timestamps by 12-40% in phoneme and word boundary errors (PBE and WBE) measured on the Buckeye and TIMIT data. Compared with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

huangruizhe/audio
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Advanced Numerical Analysis Techniques

MethodsFeedback Alignment