InterFormer: Interactive Local and Global Features Fusion for Automatic   Speech Recognition

Zhi-Hao Lai; Tian-Hao Zhang; Qi Liu; Xinyuan Qian; Li-Fang Wei,; Song-Lu Chen; Feng Chen; Xu-Cheng Yin

arXiv:2305.16342·cs.CL·May 30, 2023·1 cites

InterFormer: Interactive Local and Global Features Fusion for Automatic Speech Recognition

Zhi-Hao Lai, Tian-Hao Zhang, Qi Liu, Xinyuan Qian, Li-Fang Wei,, Song-Lu Chen, Feng Chen, Xu-Cheng Yin

PDF

Open Access

TL;DR

InterFormer introduces a novel parallel architecture combining convolution and transformer blocks with interaction and fusion modules to enhance local and global feature integration for improved speech recognition accuracy.

Contribution

The paper proposes InterFormer, a new model that facilitates interactive fusion of local and global features using BFIM and SFM modules, improving ASR performance.

Findings

01

Outperforms existing Transformer and Conformer models on public datasets.

02

Demonstrates effective local-global feature interaction.

03

Achieves superior recognition accuracy.

Abstract

The local and global features are both essential for automatic speech recognition (ASR). Many recent methods have verified that simply combining local and global features can further promote ASR performance. However, these methods pay less attention to the interaction of local and global features, and their series architectures are rigid to reflect local and global relationships. To address these issues, this paper proposes InterFormer for interactive local and global features fusion to learn a better representation for ASR. Specifically, we combine the convolution block with the transformer block in a parallel design. Besides, we propose a bidirectional feature interaction module (BFIM) and a selective fusion module (SFM) to implement the interaction and fusion of local and global features, respectively. Extensive experiments on public ASR datasets demonstrate the effectiveness of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Layer Normalization · Byte Pair Encoding · Dropout · Linear Layer · Label Smoothing · Adam · Residual Connection