AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal   Audio-Video Generation

Moayed Haji-Ali; Willi Menapace; Aliaksandr Siarohin; Ivan; Skorokhodov; Alper Canberk; Kwot Sin Lee; Vicente Ordonez; Sergey Tulyakov

arXiv:2412.15191·cs.CV·March 12, 2025

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Moayed Haji-Ali, Willi Menapace, Aliaksandr Siarohin, Ivan, Skorokhodov, Alper Canberk, Kwot Sin Lee, Vicente Ordonez, Sergey Tulyakov

PDF

Open Access

TL;DR

AV-Link introduces a unified diffusion-based framework for cross-modal audio-video generation, significantly improving synchronization by leveraging frozen diffusion models and a novel Fusion Block for bidirectional feature exchange.

Contribution

It presents a novel unified framework that performs both video-to-audio and audio-to-video generation using frozen diffusion models and a Fusion Block for cross-modal alignment.

Findings

01

Outperforms baseline models in audio-video synchronization

02

Achieves both A2V and V2A tasks within a single framework

03

Demonstrates superior results in automatic and subjective evaluations

Abstract

We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive automatic and subjective evaluations demonstrate that our method achieves a substantial improvement in audio-video synchronization, outperforming more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsSoftmax · Attention Is All You Need · Diffusion