Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech   Translation

Lucas Goncalves; Prashant Mathur; Xing Niu; Brady Houston,; Chandrashekhar Lavania; Srikanth Vishnubhotla; Lijia Sun; Anthony Ferritto

arXiv:2412.16530·cs.SD·December 24, 2024

Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation

Lucas Goncalves, Prashant Mathur, Xing Niu, Brady Houston,, Chandrashekhar Lavania, Srikanth Vishnubhotla, Lijia Sun, Anthony Ferritto

PDF

Open Access

TL;DR

This paper introduces a lip-synchrony loss into AVS2S models, significantly improving lip-movement alignment in speech translation without compromising translation quality, thus enhancing realism in dubbed videos.

Contribution

It presents a novel method that incorporates lip-synchrony constraints into AVS2S models, addressing a previously overlooked aspect of speech translation.

Findings

01

Achieved an average LSE-D score of 10.67, a 9.2% improvement over baseline.

02

Enhanced lip-synchrony without degrading translation naturalness.

03

Effective across four language pairs.

Abstract

Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media