VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices
Venkatesh S. Kadandale, Juan F. Montesinos, Gloria Haro

TL;DR
This paper introduces VocaLiST, a transformer-based model for audio-visual lip-voice synchronization that outperforms baselines and extends to singing voice, also aiding in singing voice separation.
Contribution
We propose a novel cross-modal transformer model for lip-voice synchronization applicable to speech and singing, and demonstrate its effectiveness in synchronization and singing voice separation tasks.
Findings
Our model outperforms baseline models on the LRS2 dataset.
Lip synchronization features improve singing voice separation.
The model effectively handles challenging singing voice synchronization.
Abstract
In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audio-visual correspondence score. We propose an audio-visual cross-modal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Face recognition and analysis · Subtitles and Audiovisual Media
