Multi-modal Automated Speech Scoring using Attention Fusion
Manraj Singh Grover, Yaman Kumar, Sumit Sarin, Payman Vafaee, Mika, Hama, Rajiv Ratn Shah

TL;DR
This paper introduces a multi-modal neural approach using attention fusion to improve automated scoring of non-native English speech by integrating acoustic and lexical cues, demonstrating significant performance gains.
Contribution
The study presents a novel end-to-end neural model with attention fusion for multi-modal speech scoring, combining acoustic and lexical features for enhanced accuracy.
Findings
Attention fusion improves scoring accuracy.
Combined acoustic and lexical cues outperform single modality models.
Model shows strong qualitative and quantitative performance.
Abstract
In this study, we propose a novel multi-modal end-to-end neural approach for automated assessment of non-native English speakers' spontaneous speech using attention fusion. The pipeline employs Bi-directional Recurrent Convolutional Neural Networks and Bi-directional Long Short-Term Memory Neural Networks to encode acoustic and lexical cues from spectrograms and transcriptions, respectively. Attention fusion is performed on these learned predictive features to learn complex interactions between different modalities before final scoring. We compare our model with strong baselines and find combined attention to both lexical and acoustic cues significantly improves the overall performance of the system. Further, we present a qualitative and quantitative analysis of our model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing
