Branchformer: Parallel MLP-Attention Architectures to Capture Local and   Global Context for Speech Recognition and Understanding

Yifan Peng; Siddharth Dalmia; Ian Lane; Shinji Watanabe

arXiv:2207.02971·cs.CL·July 8, 2022·40 cites

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding

Yifan Peng, Siddharth Dalmia, Ian Lane, Shinji Watanabe

PDF

Open Access 3 Repos

TL;DR

Branchformer introduces a parallel branch architecture combining self-attention and convolutional gating MLPs to effectively model local and global dependencies in speech recognition, outperforming existing models like Transformer and Conformer.

Contribution

It proposes a novel parallel branch encoder architecture for speech processing, enhancing flexibility, interpretability, and efficiency over prior models.

Findings

01

Outperforms Transformer and cgMLP on speech benchmarks.

02

Matches or exceeds state-of-the-art Conformer results.

03

Enables variable inference complexity with reduced computation strategies.

Abstract

Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing. In each encoder layer, one branch employs self-attention or its variant to capture long-range dependencies, while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local relationships. We conduct experiments on several speech recognition and spoken language understanding benchmarks. Results show that our model outperforms both Transformer and cgMLP. It also matches with or outperforms state-of-the-art results achieved by Conformer. Furthermore, we show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Absolute Position Encodings