Recent Developments on ESPnet Toolkit Boosted by Conformer
Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke, Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero,, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, Yuekai Zhang

TL;DR
This paper discusses recent enhancements to the ESPnet toolkit, highlighting the integration of the Conformer architecture across various speech processing tasks, leading to improved performance and resource-efficient research.
Contribution
Introduces the integration of Conformer architecture into ESPnet, demonstrating its effectiveness across multiple speech tasks with open-source recipes and pre-trained models.
Findings
Conformer achieves competitive or superior results compared to state-of-the-art Transformers.
Training tips significantly improve Conformer performance.
Open-source recipes facilitate research and reduce resource barriers.
Abstract
In this study, we present recent developments on ESPnet: End-to-End Speech Processing toolkit, which mainly involves a recently proposed architecture called Conformer, Convolution-augmented Transformer. This paper shows the results for a wide range of end-to-end speech processing applications, such as automatic speech recognition (ASR), speech translations (ST), speech separation (SS) and text-to-speech (TTS). Our experiments reveal various training tips and significant performance benefits obtained with the Conformer on different tasks. These results are competitive or even outperform the current state-of-art Transformer models. We are preparing to release all-in-one recipes using open source and publicly available corpora for all the above tasks with pre-trained models. Our aim for this work is to contribute to our research community by reducing the burden of preparing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
