SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet
Zhi Zhong, Akira Takahashi, Shuyang Cui, Keisuke Toyama, Shusuke Takahashi, Yuki Mitsufuji

TL;DR
This paper introduces SpecMaskFoley, a novel method that adapts a pretrained spectral masked generative transformer for synchronized video-to-audio foley synthesis, outperforming from-scratch models by using a frequency-aware temporal feature aligner.
Contribution
It proposes a new approach to steer pretrained audio models for video-synchronized foley synthesis using ControlNet and a frequency-aware feature aligner, reducing the performance gap with from-scratch models.
Findings
SpecMaskFoley outperforms from-scratch baselines on benchmark tests.
The frequency-aware temporal feature aligner effectively bridges the gap between video features and audio generation.
The method demonstrates superior synchronization and audio quality in foley synthesis.
Abstract
Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Generative Adversarial Networks and Image Synthesis
MethodsSoftmax · Attention Is All You Need
