Back to Supervision: Boosting Word Boundary Detection through Frame Classification
Simone Carnemolla, Salvatore Calcagno, Simone Palazzo, Daniela, Giordano

TL;DR
This paper introduces a supervised, model-agnostic framework for word boundary detection in speech, utilizing label augmentation and frame selection, achieving state-of-the-art results on Buckeye and TIMIT datasets.
Contribution
It presents a novel supervised approach with label augmentation and frame selection that outperforms existing methods using advanced encoder models.
Findings
HuBERT encoder achieves highest performance
State-of-the-art F-values on Buckeye and TIMIT datasets
Robust preprocessing method for audio tokenization
Abstract
Speech segmentation at both word and phoneme levels is crucial for various speech processing tasks. It significantly aids in extracting meaningful units from an utterance, thus enabling the generation of discrete elements. In this work we propose a model-agnostic framework to perform word boundary detection in a supervised manner also employing a labels augmentation technique and an output-frame selection strategy. We trained and tested on the Buckeye dataset and only tested on TIMIT one, using state-of-the-art encoder models, including pre-trained solutions (Wav2Vec 2.0 and HuBERT), as well as convolutional and convolutional recurrent networks. Our method, with the HuBERT encoder, surpasses the performance of other state-of-the-art architectures, whether trained in supervised or self-supervised settings on the same datasets. Specifically, we achieved F-values of 0.8427 on the Buckeye…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
