The 2020 ESPnet update: new features, broadened applications,   performance improvements, and future plans

Shinji Watanabe; Florian Boyer; Xuankai Chang; Pengcheng Guo; Tomoki; Hayashi; Yosuke Higuchi; Takaaki Hori; Wen-Chin Huang; Hirofumi Inaguma,; Naoyuki Kamo; Shigeki Karita; Chenda Li; Jing Shi; Aswin Shanmugam; Subramanian; Wangyou Zhang

arXiv:2012.13006·eess.AS·December 25, 2020·6 cites

The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans

Shinji Watanabe, Florian Boyer, Xuankai Chang, Pengcheng Guo, Tomoki, Hayashi, Yosuke Higuchi, Takaaki Hori, Wen-Chin Huang, Hirofumi Inaguma,, Naoyuki Kamo, Shigeki Karita, Chenda Li, Jing Shi, Aswin Shanmugam, Subramanian, Wangyou Zhang

PDF

Open Access

TL;DR

The 2020 ESPnet update introduces new features, broadens application scope to include TTS, VC, ST, SE, and achieves state-of-the-art performance with improved models and recipes for end-to-end speech processing.

Contribution

This paper details the latest developments of ESPnet, expanding its applications and enhancing performance with new models, data augmentation, and comprehensive recipes.

Findings

01

Supports multiple speech processing tasks in a unified framework

02

Achieves state-of-the-art results on various benchmarks

03

Provides reproducible recipes for community use

Abstract

This paper describes the recent development of ESPnet (https://github.com/espnet/espnet), an end-to-end speech processing toolkit. This project was initiated in December 2017 to mainly deal with end-to-end speech recognition experiments based on sequence-to-sequence modeling. The project has grown rapidly and now covers a wide range of speech processing applications. Now ESPnet also includes text to speech (TTS), voice conversation (VC), speech translation (ST), and speech enhancement (SE) with support for beamforming, speech separation, denoising, and dereverberation. All applications are trained in an end-to-end manner, thanks to the generic sequence to sequence modeling properties, and they can be further integrated and jointly optimized. Also, ESPnet provides reproducible all-in-one recipes for these applications with state-of-the-art performance in various benchmarks by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing