Sequence-to-Sequence Multi-Modal Speech In-Painting

Mahsa Kadkhodaei Elyaderani; Shahram Shirani

arXiv:2406.01321·cs.SD·June 4, 2024

Sequence-to-Sequence Multi-Modal Speech In-Painting

Mahsa Kadkhodaei Elyaderani, Shahram Shirani

PDF

Open Access

TL;DR

This paper presents a novel sequence-to-sequence model that effectively combines visual lip-reading and audio data to improve speech in-painting, outperforming audio-only models and matching recent multi-modal approaches.

Contribution

The paper introduces a new multi-modal sequence-to-sequence model that integrates visual lip-reading with audio to enhance speech in-painting performance.

Findings

01

Outperforms audio-only speech in-painting models

02

Achieves comparable results with recent multi-modal models

03

Effective for distortions of 300 ms to 1500 ms duration

Abstract

Speech in-painting is the task of regenerating missing audio contents using reliable context information. Despite various recent studies in multi-modal perception of audio in-painting, there is still a need for an effective infusion of visual and auditory information in speech in-painting. In this paper, we introduce a novel sequence-to-sequence model that leverages the visual information to in-paint audio signals via an encoder-decoder architecture. The encoder plays the role of a lip-reader for facial recordings and the decoder takes both encoder outputs as well as the distorted audio spectrograms to restore the original speech. Our model outperforms an audio-only speech in-painting model and has comparable results with a recent multi-modal speech in-painter in terms of speech quality and intelligibility metrics for distortions of 300 ms to 1500 ms duration, which proves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research