LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

Zongli Ye; Jiachen Lian; Akshaj Gupta; Xuanru Zhou; Haodong Li; Krish Patel; Hwi Joo Park; Dingkun Zhou; Chenxu Guo; Shuhe Li; Sam Wang; Iris Zhou; Cheol Jun Cho; Zoe Ezzes; Jet M.J. Vonk; Brittany T. Morin; Rian Bogley; Lisa Wauters; Zachary A. Miller; Maria Luisa Gorno-Tempini; Gopala Anumanchipalli

arXiv:2508.03937·eess.AS·August 15, 2025

LCS-CTC: Leveraging Soft Alignments to Enhance Phonetic Transcription Robustness

Zongli Ye, Jiachen Lian, Akshaj Gupta, Xuanru Zhou, Haodong Li, Krish Patel, Hwi Joo Park, Dingkun Zhou, Chenxu Guo, Shuhe Li, Sam Wang, Iris Zhou, Cheol Jun Cho, Zoe Ezzes, Jet M.J. Vonk, Brittany T. Morin, Rian Bogley, Lisa Wauters, Zachary A. Miller, Maria Luisa Gorno-Tempini

PDF

TL;DR

LCS-CTC introduces a novel two-stage phoneme recognition framework that leverages soft alignments and a constrained CTC approach to improve robustness and generalization in speech transcription, especially for unclear speech.

Contribution

It proposes a new LCS-CTC method combining similarity-aware alignment with constrained CTC training, enhancing recognition accuracy and alignment quality.

Findings

01

Outperforms vanilla CTC on LibriSpeech and PPA datasets.

02

Improves robustness in recognizing nonfluent and unclear speech.

03

Enables text-free forced alignment with higher confidence.

Abstract

Phonetic speech transcription is crucial for fine-grained linguistic analysis and downstream speech applications. While Connectionist Temporal Classification (CTC) is a widely used approach for such tasks due to its efficiency, it often falls short in recognition performance, especially under unclear and nonfluent speech. In this work, we propose LCS-CTC, a two-stage framework for phoneme-level speech recognition that combines a similarity-aware local alignment algorithm with a constrained CTC training objective. By predicting fine-grained frame-phoneme cost matrices and applying a modified Longest Common Subsequence (LCS) algorithm, our method identifies high-confidence alignment zones which are used to constrain the CTC decoding path space, thereby reducing overfitting and improving generalization ability, which enables both robust recognition and text-free forced alignment.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.