Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
Yifan Peng, Siddharth Dalmia, Ian Lane, Shinji Watanabe

TL;DR
Branchformer introduces a parallel branch architecture combining self-attention and convolutional gating MLPs to effectively model local and global dependencies in speech recognition, outperforming existing models like Transformer and Conformer.
Contribution
It proposes a novel parallel branch encoder architecture for speech processing, enhancing flexibility, interpretability, and efficiency over prior models.
Findings
Outperforms Transformer and cgMLP on speech benchmarks.
Matches or exceeds state-of-the-art Conformer results.
Enables variable inference complexity with reduced computation strategies.
Abstract
Conformer has proven to be effective in many speech processing tasks. It combines the benefits of extracting local dependencies using convolutions and global dependencies using self-attention. Inspired by this, we propose a more flexible, interpretable and customizable encoder alternative, Branchformer, with parallel branches for modeling various ranged dependencies in end-to-end speech processing. In each encoder layer, one branch employs self-attention or its variant to capture long-range dependencies, while the other branch utilizes an MLP module with convolutional gating (cgMLP) to extract local relationships. We conduct experiments on several speech recognition and spoken language understanding benchmarks. Results show that our model outperforms both Transformer and cgMLP. It also matches with or outperforms state-of-the-art results achieved by Conformer. Furthermore, we show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Position-Wise Feed-Forward Layer · Dropout · Label Smoothing · Absolute Position Encodings
