Recent Developments on ESPnet Toolkit Boosted by Conformer

Pengcheng Guo; Florian Boyer; Xuankai Chang; Tomoki Hayashi; Yosuke; Higuchi; Hirofumi Inaguma; Naoyuki Kamo; Chenda Li; Daniel Garcia-Romero,; Jiatong Shi; Jing Shi; Shinji Watanabe; Kun Wei; Wangyou Zhang; Yuekai Zhang

arXiv:2010.13956·eess.AS·October 30, 2020·40 cites

Recent Developments on ESPnet Toolkit Boosted by Conformer

Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke, Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero,, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang

PDF

Open Access 2 Models

TL;DR

This paper discusses recent enhancements to the ESPnet toolkit, highlighting the integration of the Conformer architecture across various speech processing tasks, leading to improved performance and resource-efficient research.

Contribution

Introduces the integration of Conformer architecture into ESPnet, demonstrating its effectiveness across multiple speech tasks with open-source recipes and pre-trained models.

Findings

01

Conformer achieves competitive or superior results compared to state-of-the-art Transformers.

02

Training tips significantly improve Conformer performance.

03

Open-source recipes facilitate research and reduce resource barriers.

Abstract

In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing