Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach

Mahsa Kadkhodaei Elyaderani; and Shahram Shirani

arXiv:2406.00901·cs.MM·June 4, 2024

Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach

Mahsa Kadkhodaei Elyaderani, and Shahram Shirani

PDF

Open Access

TL;DR

This paper presents a robust multi-modal speech in-painting model using a sequence-to-sequence architecture that leverages audio-visual cues and multi-task learning to improve reconstruction quality and intelligibility in challenging environments.

Contribution

The study introduces a novel multi-modal seq2seq model for speech in-painting that incorporates AV features and multi-task learning, outperforming existing transformer-based methods.

Findings

01

Outperforms state-of-the-art transformer models by 38.8% in speech quality.

02

Improves speech intelligibility by 7.14%.

03

Demonstrates robustness across various acoustic and visual distortions.

Abstract

The process of reconstructing missing parts of speech audio from context is called speech in-painting. Human perception of speech is inherently multi-modal, involving both audio and visual (AV) cues. In this paper, we introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features. Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted. To achieve this, we employ a multi-modal training paradigm that boosts the robustness of our model across various conditions involving acoustic and visual distortions. This makes our distortion-aware model a plausible solution for real-world challenging environments. We compare our method with existing transformer-based and recurrent neural network-based models, which attempt to reconstruct missing speech gaps ranging from a few milliseconds…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Speech and dialogue systems

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence