Echotune: A Modular Extractor Leveraging the Variable-Length Nature of   Speech in ASR Tasks

Sizhou Chen; Songyang Gao; Sen Fang

arXiv:2309.07765·cs.SD·April 9, 2024

Echotune: A Modular Extractor Leveraging the Variable-Length Nature of Speech in ASR Tasks

Sizhou Chen, Songyang Gao, Sen Fang

PDF

Open Access

TL;DR

This paper introduces Echo-MSA, a variable-length attention module for ASR that improves speech feature extraction across diverse durations, enhancing WER performance without sacrificing model stability.

Contribution

We propose Echo-MSA, a novel variable-length attention mechanism that effectively captures speech features across different granularities, addressing fixed-length attention limitations in ASR models.

Findings

01

Echo-MSA improves word error rate (WER) performance.

02

Integration of Echo-MSA maintains model stability.

03

Variable-length attention captures diverse speech features.

Abstract

The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Layer Normalization