E-Branchformer: Branchformer with Enhanced merging for speech   recognition

Kwangyoun Kim; Felix Wu; Yifan Peng; Jing Pan; Prashant Sridhar; Kyu; J. Han; Shinji Watanabe

arXiv:2210.00077·eess.AS·October 18, 2022·6 cites

E-Branchformer: Branchformer with Enhanced merging for speech recognition

Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu, J. Han, Shinji Watanabe

PDF

Open Access 1 Repo 10 Models

TL;DR

E-Branchformer improves upon Branchformer by enhancing the merging process and stacking modules, achieving state-of-the-art speech recognition performance on LibriSpeech without external data.

Contribution

The paper introduces E-Branchformer, a novel model that enhances Branchformer with better merging techniques and additional modules for improved ASR accuracy.

Findings

01

Achieves new state-of-the-art WERs of 1.81% and 3.65% on LibriSpeech test sets.

02

Outperforms previous models without external training data.

03

Demonstrates the effectiveness of enhanced merging and stacking in speech recognition.

Abstract

Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

espnet/espnet
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsE-Branchformer