Multi-modal Automated Speech Scoring using Attention Fusion

Manraj Singh Grover; Yaman Kumar; Sumit Sarin; Payman Vafaee; Mika; Hama; Rajiv Ratn Shah

arXiv:2005.08182·cs.CL·November 30, 2021·6 cites

Multi-modal Automated Speech Scoring using Attention Fusion

Manraj Singh Grover, Yaman Kumar, Sumit Sarin, Payman Vafaee, Mika, Hama, Rajiv Ratn Shah

PDF

Open Access

TL;DR

This paper introduces a multi-modal neural approach using attention fusion to improve automated scoring of non-native English speech by integrating acoustic and lexical cues, demonstrating significant performance gains.

Contribution

The study presents a novel end-to-end neural model with attention fusion for multi-modal speech scoring, combining acoustic and lexical features for enhanced accuracy.

Findings

01

Attention fusion improves scoring accuracy.

02

Combined acoustic and lexical cues outperform single modality models.

03

Model shows strong qualitative and quantitative performance.

Abstract

In this study, we propose a novel multi-modal end-to-end neural approach for automated assessment of non-native English speakers' spontaneous speech using attention fusion. The pipeline employs Bi-directional Recurrent Convolutional Neural Networks and Bi-directional Long Short-Term Memory Neural Networks to encode acoustic and lexical cues from spectrograms and transcriptions, respectively. Attention fusion is performed on these learned predictive features to learn complex interactions between different modalities before final scoring. We compare our model with strong baselines and find combined attention to both lexical and acoustic cues significantly improves the overall performance of the system. Further, we present a qualitative and quantitative analysis of our model.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Speech and Audio Processing