ESPnet: End-to-End Speech Processing Toolkit
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro, Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew, Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai

TL;DR
ESPnet is an open-source toolkit for end-to-end speech recognition that integrates neural network frameworks with Kaldi-style data processing, offering a comprehensive platform for speech processing research and development.
Contribution
It introduces a unified, flexible platform combining Chainer/PyTorch with Kaldi-style workflows for end-to-end speech recognition.
Findings
Achieved competitive results on major ASR benchmarks.
Provides a versatile toolkit supporting various speech processing tasks.
Demonstrates ease of use and extensibility for research.
Abstract
This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗GunnarThor/talromur_f_tacotron2model· 4 dl4 dl
- 🤗byan/librispeech_asr_train_asr_conformer_raw_bpe_batch_bins30000000_accum_grad3_optim_conflr0.001_spmodel· 7 dl7 dl
- 🤗byan/librispeech_asr_train_asr_transformer_e18_raw_bpe_spmodel· 5 dl5 dl
- 🤗eml914/streaming_transformer_asr_librispeechmodel· 1 dl1 dl
- 🤗espnet/Chenda_Li_wsj0_2mix_enh_train_enh_conv_tasnet_raw_valid.si_snr.avemodel· 3 dl3 dl
- 🤗espnet/Chenda_Li_wsj0_2mix_enh_train_enh_rnn_tf_raw_valid.si_snr.avemodel· 4 dl4 dl
- 🤗espnet/Emiru_Tsunoo_aishell_asr_train_asr_streaming_transformer_raw_zh_char_sp_valid.acc.avemodel· 3 dl3 dl
- 🤗espnet/Hoon_Chung_jsut_asr_train_asr_conformer8_raw_char_sp_valid.acc.avemodel· 1 dl· ♡ 11 dl♡ 1
- 🤗espnet/Hoon_Chung_zeroth_korean_asr_train_asr_transformer5_raw_bpe_valid.acc.avemodel· 2 dl2 dl
- 🤗espnet/Karthik_DSTC2_asr_train_asr_Hubert_transformermodel· 7 dl7 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
MethodsHierarchical Feature Fusion · Dilated Convolution · Pointwise Convolution · Convolution · Efficient Spatial Pyramid · Kaiming Initialization · 1x1 Convolution · Parameterized ReLU · ESPNet
