Improving Lip-synchrony in Direct Audio-Visual Speech-to-Speech Translation
Lucas Goncalves, Prashant Mathur, Xing Niu, Brady Houston,, Chandrashekhar Lavania, Srikanth Vishnubhotla, Lijia Sun, Anthony Ferritto

TL;DR
This paper introduces a lip-synchrony loss into AVS2S models, significantly improving lip-movement alignment in speech translation without compromising translation quality, thus enhancing realism in dubbed videos.
Contribution
It presents a novel method that incorporates lip-synchrony constraints into AVS2S models, addressing a previously overlooked aspect of speech translation.
Findings
Achieved an average LSE-D score of 10.67, a 9.2% improvement over baseline.
Enhanced lip-synchrony without degrading translation naturalness.
Effective across four language pairs.
Abstract
Audio-Visual Speech-to-Speech Translation typically prioritizes improving translation quality and naturalness. However, an equally critical aspect in audio-visual content is lip-synchrony-ensuring that the movements of the lips match the spoken content-essential for maintaining realism in dubbed videos. Despite its importance, the inclusion of lip-synchrony constraints in AVS2S models has been largely overlooked. This study addresses this gap by integrating a lip-synchrony loss into the training process of AVS2S models. Our proposed method significantly enhances lip-synchrony in direct audio-visual speech-to-speech translation, achieving an average LSE-D score of 10.67, representing a 9.2% reduction in LSE-D over a strong baseline across four language pairs. Additionally, it maintains the naturalness and high quality of the translated speech when overlaid onto the original video,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media
