Video-to-Audio Generation with Hidden Alignment
Manjie Xu, Chenxing Li, Xinyi Tu, Yong Ren, Rilin Chen, Yu Gu, Wei, Liang, Dong Yu

TL;DR
This paper explores methods for generating synchronized audio from video inputs, focusing on vision encoders, auxiliary embeddings, and data augmentation, achieving state-of-the-art results in video-to-audio alignment.
Contribution
It introduces a foundational model and comprehensive evaluation pipeline, providing new insights into the effects of different encoders, embeddings, and augmentation techniques for improved video-to-audio generation.
Findings
State-of-the-art video-audio synchronization performance
Impact of data augmentation on generation quality
Effectiveness of various vision encoders and embeddings
Abstract
Generating semantically and temporally aligned audio content in accordance with video input has become a focal point for researchers, particularly following the remarkable breakthrough in text-to-video generation. In this work, we aim to offer insights into the video-to-audio generation paradigm, focusing on three crucial aspects: vision encoders, auxiliary embeddings, and data augmentation techniques. Beginning with a foundational model built on a simple yet surprisingly effective intuition, we explore various vision encoders and auxiliary embeddings through ablation studies. Employing a comprehensive evaluation pipeline that emphasizes generation quality and video-audio synchronization alignment, we demonstrate that our model exhibits state-of-the-art video-to-audio generation capabilities. Furthermore, we provide critical insights into the impact of different data augmentation…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. This paper proposes a baseline of the LDM model that surpasses the other comparison methods in some metrics. 2. It performed various ablations on video encoder selection, data augmentation, and text embedding, adding up to this paper's empirical value.
1. It shows that the text embedding would help the model generate the audio; I think it makes sense, but I’d like to know how details of the caption are needed. Does dense video caption help, or just a succinct video caption is enough? 2. Minor: Why do Experiment Setup and Experiments in separate sections? I think it’s common to pub experimental setup as a subsection under the experiment section. 3. In L383, the data augmentation is to randomly combine video and audio segments. But does this re
1. Lots of studies on the design choices. While many components of the model inherits previous pretrained frozen models (which I don't think is a weakness), the author has experimented with many versions of vision encoders (6 in the paper Table 2.). 2. Good paper writing. The paper is well written. Despite a few missing details, the overall paper is easy to follow. 3. Lot's of evaluation metrics. The paper has used lots of evaluate metrics, which help in assessing the overall model in semantic
1. Lack of evaluation dataset and synchronization metrics VGGSound is in fact a noisy dataset to measure both: on one hand, it contains overwhelming audio noises, and many video audio pairs are not semantically aligned; on the other hand, the video-audio synchronization is even poor due to camera shaking, out of scene sound sources, etc. Previous papers on VTA has evaluated on some other datasets, which can be simpler but much cleaner, e.g., evaluate semantics alignment on Landscapes dataset (Se
- The motivation behind this study is significant for a deeper understanding on how we should construct video-to-audio models. Many recent video-to-audio models are constructed by extending audio generation models to make them accept video (and any other auxiliary) conditions. Thus, the choice of vision encoders and auxiliary embeddings should be a particular and important challenge, which has empirically been explored in this study. Rigorous investigation would be expected to significantly bene
- Several descriptions in the paper are unclear, making it difficult to follow the authors' arguments. - Each vision encoder should have a different resolution (and provide a different size of features), but its details are not given in the paper. I would suggest that the authors provide a table or detailed description of the temporal and spatial resolutions used for each vision encoder, as well as how these were standardized (if at all) for fair comparison. - Particulaly, it would be bene
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies
