ESPnet: End-to-End Speech Processing Toolkit

Shinji Watanabe; Takaaki Hori; Shigeki Karita; Tomoki Hayashi; Jiro; Nishitoba; Yuya Unno; Nelson Enrique Yalta Soplin; Jahn Heymann; Matthew; Wiesner; Nanxin Chen; Adithya Renduchintala; Tsubasa Ochiai

arXiv:1804.00015·cs.CL·April 3, 2018·74 cites

ESPnet: End-to-End Speech Processing Toolkit

Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro, Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew, Wiesner, Nanxin Chen, Adithya Renduchintala, Tsubasa Ochiai

PDF

Open Access 10 Models

TL;DR

ESPnet is an open-source toolkit for end-to-end speech recognition that integrates neural network frameworks with Kaldi-style data processing, offering a comprehensive platform for speech processing research and development.

Contribution

It introduces a unified, flexible platform combining Chainer/PyTorch with Kaldi-style workflows for end-to-end speech recognition.

Findings

01

Achieved competitive results on major ASR benchmarks.

02

Provides a versatile toolkit supporting various speech processing tasks.

03

Demonstrates ease of use and extensibility for research.

Abstract

This paper introduces a new open source platform for end-to-end speech processing named ESPnet. ESPnet mainly focuses on end-to-end automatic speech recognition (ASR), and adopts widely-used dynamic neural network toolkits, Chainer and PyTorch, as a main deep learning engine. ESPnet also follows the Kaldi ASR toolkit style for data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments. This paper explains a major architecture of this software platform, several important functionalities, which differentiate ESPnet from other open source ASR toolkits, and experimental results with major ASR benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques

MethodsHierarchical Feature Fusion · Dilated Convolution · Pointwise Convolution · Convolution · Efficient Spatial Pyramid · Kaiming Initialization · 1x1 Convolution · Parameterized ReLU · ESPNet