Cognitive Coding of Speech
Reza Lotfidereshgi, Philippe Gournay

TL;DR
This paper introduces a hierarchical neural network approach for unsupervised cognitive coding of speech, capturing different speech attributes at multiple time scales, with applications in speech compression.
Contribution
It presents a novel two-stage neural network model that hierarchically encodes speech attributes at different temporal resolutions, improving predictive capability and compression performance.
Findings
Performance exceeds state-of-the-art on LibriSpeech and EmoV-DB datasets.
Effective in capturing phoneme, speaker, and emotion attributes.
Robust to dimensionality reduction and low bitrate quantization.
Abstract
We propose an approach for cognitive coding of speech by unsupervised extraction of contextual representations in two hierarchical levels of abstraction. Speech attributes such as phoneme identity that last one hundred milliseconds or less are captured in the lower level of abstraction, while speech attributes such as speaker identity and emotion that persist up to one second are captured in the higher level of abstraction. This decomposition is achieved by a two-stage neural network, with a lower and an upper stage operating at different time scales. Both stages are trained to predict the content of the signal in their respective latent spaces. A top-down pathway between stages further improves the predictive capability of the network. With an application in speech compression in mind, we investigate the effect of dimensionality reduction and low bitrate quantization on the extracted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Advanced Data Compression Techniques
