AlignNet: A Unifying Approach to Audio-Visual Alignment

Jianren Wang; Zhaoyuan Fang; Hang Zhao

arXiv:2002.05070·cs.CV·February 13, 2020·1 cites

AlignNet: A Unifying Approach to Audio-Visual Alignment

Jianren Wang, Zhaoyuan Fang, Hang Zhao

PDF

Open Access 1 Repo

TL;DR

AlignNet is a novel model that effectively synchronizes videos with reference audios despite irregular misalignments by learning dense frame-to-audio correspondence, significantly outperforming existing methods.

Contribution

The paper introduces AlignNet, a unified approach leveraging attention, pyramidal processing, warping, and affinity functions for robust audio-visual alignment, along with a new dataset Dance50.

Findings

01

Outperforms state-of-the-art methods in dance-music and speech-lip alignment.

02

Effective handling of non-uniform and irregular misalignments.

03

Provides a new dataset for training and evaluation.

Abstract

We present AlignNet, a model that synchronizes videos with reference audios under non-uniform and irregular misalignments. AlignNet learns the end-to-end dense correspondence between each frame of a video and an audio. Our method is designed according to simple and well-established principles: attention, pyramidal processing, warping, and affinity function. Together with the model, we release a dancing dataset Dance50 for training and evaluation. Qualitative, quantitative and subjective evaluation results on dance-music alignment and speech-lip alignment demonstrate that our method far outperforms the state-of-the-art methods. Project video and code are available at https://jianrenw.github.io/AlignNet.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zfang399/AlignNet
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization