Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks
Sizhou Chen, Songyang Gao, Sen Fang

TL;DR
This paper introduces Echo-MSA, a variable-length attention module for ASR that improves speech feature extraction across diverse durations, enhancing WER performance without sacrificing model stability.
Contribution
We propose Echo-MSA, a novel variable-length attention mechanism that effectively captures speech features across different granularities, addressing fixed-length attention limitations in ASR models.
Findings
Echo-MSA improves word error rate (WER) performance.
Integration of Echo-MSA maintains model stability.
Variable-length attention captures diverse speech features.
Abstract
The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Layer Normalization
