Interpretable Convolutional SyncNet
Sungjoon Park, Jaesub Yun, Donggeon Lee, Minsik Park

TL;DR
This paper introduces a convolutional sync-net trained with a balanced BCE loss, improving interpretability, handling larger images, and achieving state-of-the-art accuracy in audio-visual video synchronization tasks.
Contribution
The work presents a convolutional sync-net with a novel balanced BCE loss, enabling better interpretability and scalability over transformer-based models.
Findings
Achieves 96.5% accuracy on LRS2 dataset.
Achieves 93.8% accuracy on LRS3 dataset.
Provides probabilistic metrics for sync quality evaluation.
Abstract
Because videos in the wild can be out of sync for various reasons, a sync-net is used to bring the video back into sync for tasks that require synchronized videos. Previous state-of-the-art (SOTA) sync-nets use InfoNCE loss, rely on the transformer architecture, or both. Unfortunately, the former makes the model's output difficult to interpret, and the latter is unfriendly with large images, thus limiting the usefulness of sync-nets. In this work, we train a convolutional sync-net using the balanced BCE loss (BBCE), a loss inspired by the binary cross entropy (BCE) and the InfoNCE losses. In contrast to the InfoNCE loss, the BBCE loss does not require complicated sampling schemes. Our model can better handle larger images, and its output can be given a probabilistic interpretation. The probabilistic interpretation allows us to define metrics such as probability at offset and offscreen…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
MethodsInfoNCE
