DRED: Deep REDundancy Coding of Speech Using a Rate-Distortion-Optimized Variational Autoencoder
Jean-Marc Valin, Jan B\"uthe, Ahmed Mustafa, Michael Klingbeil

TL;DR
This paper introduces DRED, a novel deep speech redundancy coding method using a rate-distortion-optimized variational autoencoder, enabling efficient transmission of large redundancy at low bitrates to improve packet loss recovery.
Contribution
The paper presents a new RDO-VAE based approach for deep speech redundancy coding, significantly increasing redundancy capacity at low bitrates and outperforming existing codecs in packet loss scenarios.
Findings
DRED transmits up to 50x redundancy at under 32 kb/s.
DRED outperforms Opus codec redundancy in tests.
Benefits demonstrated in WebRTC context.
Abstract
Despite recent advancements in packet loss concealment (PLC) using deep learning techniques, packet loss remains a significant challenge in real-time speech communication. Redundancy has been used in the past to recover the missing information during losses. However, conventional redundancy techniques are limited in the maximum loss duration they can cover and are often unsuitable for burst packet loss. We propose a new approach based on a rate-distortion-optimized variational autoencoder (RDO-VAE), allowing us to optimize a deep speech compression algorithm for the task of encoding large amounts of redundancy at very low bitrate. The proposed Deep REDundancy (DRED) algorithm can transmit up to 50x redundancy using less than 32 kb/s. Results show that DRED outperforms the existing Opus codec redundancy. We also demonstrate its benefits when operating in the context of WebRTC.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Advanced Data Compression Techniques · Speech Recognition and Synthesis
