Evaluating context-invariance in unsupervised speech representations
Mark Hallap, Emmanuel Dupoux, Ewan Dunbar

TL;DR
This paper introduces a new benchmark to measure context-invariance in unsupervised speech representations, revealing its importance for stable word-level encoding and guiding future research directions.
Contribution
It develops a novel version of the ZeroSpeech ABX benchmark specifically for assessing context-invariance in speech representations.
Findings
Context-invariance correlates with word-level stability.
Current models vary significantly in context-invariance.
Improving context-invariance could enhance language understanding.
Abstract
Unsupervised speech representations have taken off, with benchmarks (SUPERB, ZeroSpeech) demonstrating major progress on semi-supervised speech recognition, speech synthesis, and speech-only language modelling. Inspiration comes from the promise of ``discovering the phonemes'' of a language or a similar low-bitrate encoding. However, one of the critical properties of phoneme transcriptions is context-invariance: the phonetic context of a speech sound can have massive influence on the way it is pronounced, while the text remains stable. This is what allows tokens of the same word to have the same transcriptions -- key to language understanding. Current benchmarks do not measure context-invariance. We develop a new version of the ZeroSpeech ABX benchmark that measures context-invariance, and apply it to recent self-supervised representations. We demonstrate that the context-independence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
