SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Zhi Zhong; Akira Takahashi; Shuyang Cui; Keisuke Toyama; Shusuke Takahashi; Yuki Mitsufuji

arXiv:2505.16195·cs.SD·July 21, 2025

SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji

PDF

Open Access

TL;DR

This paper introduces SpecMaskFoley, a novel method that adapts a pretrained spectral masked generative transformer for synchronized video-to-audio foley synthesis, outperforming from-scratch models by using a frequency-aware temporal feature aligner.

Contribution

It proposes a new approach to steer pretrained audio models for video-synchronized foley synthesis using ControlNet and a frequency-aware feature aligner, reducing the performance gap with from-scratch models.

Findings

01

SpecMaskFoley outperforms from-scratch baselines on benchmark tests.

02

The frequency-aware temporal feature aligner effectively bridges the gap between video features and audio generation.

03

The method demonstrates superior synchronization and audio quality in foley synthesis.

Abstract

Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis

MethodsSoftmax · Attention Is All You Need