SpeechFormer: A Hierarchical Efficient Framework Incorporating the Characteristics of Speech
Weidong Chen, Xiaofen Xing, Xiangmin Xu, Jianxin Pang, Lan Du

TL;DR
SpeechFormer is a hierarchical Transformer framework designed for speech processing that leverages speech structure to improve efficiency and performance in emotion recognition and neurocognitive disorder detection.
Contribution
It introduces a novel hierarchical structure considering speech characteristics, reducing computational cost while maintaining or improving accuracy.
Findings
Outperforms standard Transformer in speech tasks
Reduces computational cost significantly
Achieves comparable results to state-of-the-art methods
Abstract
Transformer has obtained promising results on cognitive speech signal processing field, which is of interest in various applications ranging from emotion to neurocognitive disorder analysis. However, most works treat speech signal as a whole, leading to the neglect of the pronunciation structure that is unique to speech and reflects the cognitive process. Meanwhile, Transformer has heavy computational burden due to its full attention operation. In this paper, a hierarchical efficient framework, called SpeechFormer, which considers the structural characteristics of speech, is proposed and can be served as a general-purpose backbone for cognitive speech signal processing. The proposed SpeechFormer consists of frame, phoneme, word and utterance stages in succession, each performing a neighboring attention according to the structural pattern of speech with high computational efficiency.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Emotion and Mood Recognition · EEG and Brain-Computer Interfaces
