Learning An Invariant Speech Representation
Georgios Evangelopoulos, Stephen Voinea, Chiyuan Zhang, Lorenzo, Rosasco, Tomaso Poggio

TL;DR
This paper introduces a new invariant speech representation method that improves phoneme classification accuracy and reduces sample complexity by learning features robust to transformations, inspired by visual domain theories.
Contribution
It extends a theory of invariant visual representations to speech, proposing a template-based, quasi-invariant feature extraction approach for small-sample speech recognition.
Findings
Improved vowel classification accuracy.
Reduced sample complexity compared to standard features.
Effective hierarchical architecture extension.
Abstract
Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
