Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Puyuan Peng, Shang-Wen Li, Okko R\"as\"anen, Abdelrahman Mohamed,, David Harwath

TL;DR
This study demonstrates that visually-grounded self-supervised speech models naturally develop syllabic representations, enabling effective zero-shot cross-lingual syllable and word segmentation, outperforming existing methods.
Contribution
The paper introduces a visually-grounded training approach that fosters syllable discovery in speech models and demonstrates zero-shot cross-lingual generalization capabilities.
Findings
Model captures syllabic units through visual grounding.
Outperforms state-of-the-art syllabic segmentation on English.
Achieves zero-shot generalization to Estonian and other languages.
Abstract
In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and dialogue systems
