ViSAGe: Video-to-Spatial Audio Generation
Jaeyeon Kim, Heeseung Yun, Gunhee Kim

TL;DR
This paper introduces ViSAGe, an end-to-end framework that generates spatial first-order ambisonics audio directly from silent videos, supported by a new large-scale dataset and novel evaluation metrics, advancing immersive audio-visual experiences.
Contribution
The work presents ViSAGe, a novel end-to-end model for video-to-spatial audio generation, along with the YT-Ambigen dataset and new evaluation metrics for spatial audio quality assessment.
Findings
ViSAGe outperforms two-stage approaches in producing plausible spatial audio.
Generated spatial audio is temporally aligned and adapts to viewpoint changes.
The approach demonstrates high-quality, coherent spatial audio generation from silent videos.
Abstract
Spatial audio is essential for enhancing the immersiveness of audio-visual experiences, yet its production typically demands complex recording systems and specialized expertise. In this work, we address a novel problem of generating first-order ambisonics, a widely used spatial audio format, directly from silent videos. To support this task, we introduce YT-Ambigen, a dataset comprising 102K 5-second YouTube video clips paired with corresponding first-order ambisonics. We also propose new evaluation metrics to assess the spatial aspect of generated audio based on audio energy maps and saliency metrics. Furthermore, we present Video-to-Spatial Audio Generation (ViSAGe), an end-to-end framework that generates first-order ambisonics from silent video frames by leveraging CLIP visual features, autoregressive neural audio codec modeling with both directional and visual guidance. Experimental…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper introduces a groundbreaking approach to directly generate spatial audio from video, addressing a previously unsolved problem and offering a significant advancement in the field of immersive media. 2. The ViSAGe framework is an end-to-end solution that integrates neural audio codecs with visual features, which is a novel combination in the context of audio generation from video. 3. The paper is well-structured, with a clear problem statement, including the introduction of a new datas
As shown in Figure 2, the framework designed in this paper uses many different modules. Therefore, the computational complexity (model parameters) and running time (inference time) of the overall framework need to be discussed.
- The paper is clearly written and well-presented. - The proposed task of generating spatial audio from silent video is interesting and takes one step further of existing works that tackle this task separately. - The dataset curation pipeline effectively addresses the limitations of existing datasets, and the dataset would make contribution to the community.
- Lack of moving object samples or objects that are not centered. - When listening to the synthesized audio while watching the video, most content appears centered, which makes the synthesis task relatively simple and appears to be similar to the mono audio generation task. - To fully demonstrate the effectiveness of the proposed method and task, more qualitative examples are needed that show results in challenging scenarios (e.g., objects moving from left to right, objects not centered,
- The problem is new, interesting, and important for the research community - The authors introduce a new dataset for this new task, setting a benchmark for future spatial audio generation research. - Suitable baselines and metrics are proposed to compare on this new task - Carefully designed components are incorporated, for eg. FOA encoding, sequence of its generation, rotation augmentation, patchwise energy maps
Major: Missing Subjective tests: - The paper lacks subjective evaluations; studies assessing quality, and directionality ( or localization accuracy) should be included. Authors should compare their approach with baselines on metrics like mean opinion score (or other subjective metrics). Demo examples - While the demo examples appear semantically good, the sounds are often too diffuse, making it challenging to precisely localize the direction of the audio. - Including some static sources with
Videos
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing
MethodsContrastive Language-Image Pre-training
