Synchformer: Efficient Synchronization from Sparse Cues
Vladimir Iashin, Weidi Xie, Esa Rahtu, Andrew Zisserman

TL;DR
Synchformer introduces an efficient audio-visual synchronization model tailored for in-the-wild videos, leveraging contrastive pre-training to achieve state-of-the-art results in both dense and sparse cue scenarios.
Contribution
It presents a novel synchronization model with a decoupled training approach and extends to large-scale datasets, improving interpretability and adding new capabilities.
Findings
Achieves state-of-the-art performance in synchronization tasks.
Effective on both dense and sparse cues.
Extends to large-scale 'in-the-wild' datasets.
Abstract
Our objective is audio-visual synchronization with a focus on 'in-the-wild' videos, such as those on YouTube, where synchronization cues can be sparse. Our contributions include a novel audio-visual synchronization model, and training that decouples feature extraction from synchronization modelling through multi-modal segment-level contrastive pre-training. This approach achieves state-of-the-art performance in both dense and sparse settings. We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Neural Networks and Applications · Photonic and Optical Devices
MethodsFocus
