End-to-End Optimized Speech Coding with Deep Neural Networks

Srihari Kankanahalli

arXiv:1710.09064·cs.SD·July 9, 2021

End-to-End Optimized Speech Coding with Deep Neural Networks

Srihari Kankanahalli

PDF

TL;DR

This paper introduces a deep neural network that optimizes the entire speech coding process end-to-end, achieving comparable performance to industry standards with faster training and real-time decoding.

Contribution

It presents the first end-to-end neural speech coder that automates all pipeline steps without manual feature engineering, matching traditional standards in quality.

Findings

01

Performs on par with AMR-WB at 9-24 kbps

02

Trains in hours, not years

03

Runs in real-time on standard CPU

Abstract

Modern compression algorithms are often the result of laborious domain-specific research; industry standards such as MP3, JPEG, and AMR-WB took years to develop and were largely hand-designed. We present a deep neural network model which optimizes all the steps of a wideband speech coding pipeline (compression, quantization, entropy coding, and decompression) end-to-end directly from raw speech data -- no manual feature engineering necessary, and it trains in hours. In testing, our DNN-based coder performs on par with the AMR-WB standard at a variety of bitrates (~9kbps up to ~24kbps). It also runs in realtime on a 3.8GhZ Intel CPU.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.