Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach
Mahsa Kadkhodaei Elyaderani, and Shahram Shirani

TL;DR
This paper presents a robust multi-modal speech in-painting model using a sequence-to-sequence architecture that leverages audio-visual cues and multi-task learning to improve reconstruction quality and intelligibility in challenging environments.
Contribution
The study introduces a novel multi-modal seq2seq model for speech in-painting that incorporates AV features and multi-task learning, outperforming existing transformer-based methods.
Findings
Outperforms state-of-the-art transformer models by 38.8% in speech quality.
Improves speech intelligibility by 7.14%.
Demonstrates robustness across various acoustic and visual distortions.
Abstract
The process of reconstructing missing parts of speech audio from context is called speech in-painting. Human perception of speech is inherently multi-modal, involving both audio and visual (AV) cues. In this paper, we introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features. Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted. To achieve this, we employ a multi-modal training paradigm that boosts the robustness of our model across various conditions involving acoustic and visual distortions. This makes our distortion-aware model a plausible solution for real-world challenging environments. We compare our method with existing transformer-based and recurrent neural network-based models, which attempt to reconstruct missing speech gaps ranging from a few milliseconds…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPhonetics and Phonology Research · Speech and dialogue systems
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
