Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation
Feizhen Huang, Yu Wu, Yutian Lin, Bo Du

TL;DR
This paper introduces a self-distillation method for video-to-audio generation that improves performance in scenarios with partially visible cinematic language, enhancing the model's ability to associate sounds with incomplete visual cues.
Contribution
It presents a novel self-distillation approach that enables V2A models to better handle partial visual information by simulating cinematic language variations.
Findings
Significant performance improvements in partial visibility scenarios
Enhanced results on the VGGSound dataset
Effective alignment of video features with audio-visual correspondences
Abstract
Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
