End-to-End Optimized Speech Coding with Deep Neural Networks
Srihari Kankanahalli

TL;DR
This paper introduces a deep neural network that optimizes the entire speech coding process end-to-end, achieving comparable performance to industry standards with faster training and real-time decoding.
Contribution
It presents the first end-to-end neural speech coder that automates all pipeline steps without manual feature engineering, matching traditional standards in quality.
Findings
Performs on par with AMR-WB at 9-24 kbps
Trains in hours, not years
Runs in real-time on standard CPU
Abstract
Modern compression algorithms are often the result of laborious domain-specific research; industry standards such as MP3, JPEG, and AMR-WB took years to develop and were largely hand-designed. We present a deep neural network model which optimizes all the steps of a wideband speech coding pipeline (compression, quantization, entropy coding, and decompression) end-to-end directly from raw speech data -- no manual feature engineering necessary, and it trains in hours. In testing, our DNN-based coder performs on par with the AMR-WB standard at a variety of bitrates (~9kbps up to ~24kbps). It also runs in realtime on a 3.8GhZ Intel CPU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
