Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding
Milos Cernak, Alexandros Lazaridis, Afsaneh Asaei, Philip N. Garner

TL;DR
This paper introduces a novel low bit rate speech coding system using deep and spiking neural networks that improves speech quality and reduces artifacts compared to traditional HMM-based methods.
Contribution
It presents an end-to-end neural network framework for speech coding that eliminates HMMs and uses phonological representations for smoother, more natural speech synthesis.
Findings
Listeners preferred NN-based speech due to fewer artifacts.
The system operates at approximately 360 bits/sec.
Single forward pass enables efficient encoding and decoding.
Abstract
Most current very low bit rate (VLBR) speech coding systems use hidden Markov model (HMM) based speech recognition/synthesis techniques. This allows transmission of information (such as phonemes) segment by segment that decreases the bit rate. However, the encoder based on a phoneme speech recognition may create bursts of segmental errors. Segmental errors are further propagated to optional suprasegmental (such as syllable) information coding. Together with the errors of voicing detection in pitch parametrization, HMM-based speech coding creates speech discontinuities and unnatural speech sound artefacts. In this paper, we propose a novel VLBR speech coding framework based on neural networks (NNs) for end-to-end speech analysis and synthesis without HMMs. The speech coding framework relies on phonological (sub-phonetic) representation of speech, and it is designed as a composition of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Compression Techniques · Speech and Audio Processing · Speech Recognition and Synthesis
