DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition

Jiamin Xie; John H.L. Hansen

arXiv:2207.01732·eess.AS·June 19, 2025·Interspeech

DEFORMER: Coupling Deformed Localized Patterns with Global Context for Robust End-to-end Speech Recognition

Jiamin Xie, John H.L. Hansen

PDF

Open Access

TL;DR

This paper introduces Deformer, a novel speech recognition model that replaces standard CNNs with deformable kernels, capturing asymmetric local features and improving recognition accuracy by 5.6% relative WER.

Contribution

The study proposes deformable kernels in CNNs for speech recognition, enhancing local feature coupling with attention and improving performance over the Conformer baseline.

Findings

01

Deformer achieves +5.6% relative WER reduction without LM.

02

Visualizations show better local-global feature coupling.

03

Kernel offset analysis reveals feature information changes with depth.

Abstract

Convolutional neural networks (CNN) have improved speech recognition performance greatly by exploiting localized time-frequency patterns. But these patterns are assumed to appear in symmetric and rigid kernels by the conventional CNN operation. It motivates the question: What about asymmetric kernels? In this study, we illustrate adaptive views can discover local features which couple better with attention than fixed views of the input. We replace depthwise CNNs in the Conformer architecture with a deformable counterpart, dubbed this "Deformer". By analyzing our best-performing model, we visualize both local receptive fields and global attention maps learned by the Deformer and show increased feature associations on the utterance level. The statistical analysis of learned kernel offsets provides an insight into the change of information in features with the network depth. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing