Frequency-Directional Attention Model for Multilingual Automatic Speech Recognition
Akihiro Dobashi, Chee Siang Leow, Hiromitsu Nishizaki

TL;DR
This paper introduces a frequency-directional attention model using a Transformer-encoder to improve multilingual end-to-end speech recognition by transforming features according to language-specific frequency characteristics, leading to higher accuracy.
Contribution
It presents a novel frequency-directional attention mechanism for multilingual ASR, enhancing feature transformation and recognition accuracy across languages.
Findings
Achieved an average of 5.3 points higher accuracy across six languages.
Demonstrated the effectiveness of frequency-directional attention in feature transformation.
Visualized attention weights to show language-specific frequency considerations.
Abstract
This paper proposes a model for transforming speech features using the frequency-directional attention model for End-to-End (E2E) automatic speech recognition. The idea is based on the hypothesis that in the phoneme system of each language, the characteristics of the frequency bands of speech when uttering them are different. By transforming the input Mel filter bank features with an attention model that characterizes the frequency direction, a feature transformation suitable for ASR in each language can be expected. This paper introduces a Transformer-encoder as a frequency-directional attention model. We evaluated the proposed method on a multilingual E2E ASR system for six different languages and found that the proposed method could achieve, on average, 5.3 points higher accuracy than the ASR model for each language by introducing the frequency-directional attention mechanism.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
