Syllable Discovery and Cross-Lingual Generalization in a Visually   Grounded, Self-Supervised Speech Model

Puyuan Peng; Shang-Wen Li; Okko R\"as\"anen; Abdelrahman Mohamed,; David Harwath

arXiv:2305.11435·eess.AS·July 25, 2023·1 cites

Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model

Puyuan Peng, Shang-Wen Li, Okko R\"as\"anen, Abdelrahman Mohamed,, David Harwath

PDF

Open Access 2 Repos

TL;DR

This study demonstrates that visually-grounded self-supervised speech models naturally develop syllabic representations, enabling effective zero-shot cross-lingual syllable and word segmentation, outperforming existing methods.

Contribution

The paper introduces a visually-grounded training approach that fosters syllable discovery in speech models and demonstrates zero-shot cross-lingual generalization capabilities.

Findings

01

Model captures syllabic units through visual grounding.

02

Outperforms state-of-the-art syllabic segmentation on English.

03

Achieves zero-shot generalization to Estonian and other languages.

Abstract

In this paper, we show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We demonstrate that a nearly identical model architecture (HuBERT) trained with a masked language modeling loss does not exhibit this same ability, suggesting that the visual grounding objective is responsible for the emergence of this phenomenon. We propose the use of a minimum cut algorithm to automatically predict syllable boundaries in speech, followed by a 2-stage clustering method to group identical syllables together. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian. Finally, we show that the same model is capable of zero-shot generalization for a word segmentation task on 4…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Multimodal Machine Learning Applications · Speech and dialogue systems