Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric
Hyeongju Kim, Hyeong-Seok Choi

TL;DR
This paper introduces SuperSeg, an autoregressive model for phoneme boundary detection, and proposes improved evaluation metrics to more accurately assess boundary detection performance.
Contribution
It presents a novel autoregressive boundary detector and new evaluation metrics that address limitations of existing measures, improving reliability in phoneme boundary detection assessment.
Findings
SuperSeg outperforms existing models on TIMIT and Buckeye datasets.
New metrics prevent multiple boundary contributions, offering more reliable evaluation.
Autoregressive approach enhances phoneme boundary detection accuracy.
Abstract
Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
