On the Difficulty of Segmenting Words with Attention
Ramon Sanabria, Hao Tang, Sharon Goldwater

TL;DR
This paper investigates the effectiveness of attention mechanisms for word segmentation in speech, revealing that they are unreliable unless models are trained specifically for phone-to-word prediction, limiting their generalizability.
Contribution
The study demonstrates that attention-based segmentation only works well when models are trained to predict phones from words, highlighting its limited applicability in other training scenarios.
Findings
Attention-based segmentation is brittle on monolingual data.
Models predicting phones from words succeed in segmentation.
Models predicting words from phones perform poorly in segmentation.
Abstract
Word segmentation, the problem of finding word boundaries in speech, is of interest for a range of tasks. Previous papers have suggested that for sequence-to-sequence models trained on tasks such as speech translation or speech recognition, attention can be used to locate and segment the words. We show, however, that even on monolingual data this approach is brittle. In our experiments with different input types, data sizes, and segmentation algorithms, only models trained to predict phones from words succeed in the task. Models trained to predict words from either phones or speech (i.e., the opposite direction needed to generalize to new data), yield much worse results, suggesting that attention-based segmentation is only useful in limited scenarios.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
