Efficient Video to Audio Mapper with Visual Scene Detection

Mingjing Yi; Ming Li

arXiv:2409.09823·cs.SD·September 17, 2024

Efficient Video to Audio Mapper with Visual Scene Detection

Mingjing Yi, Ming Li

PDF

Open Access

TL;DR

This paper introduces an improved video-to-audio generation model that incorporates scene detection, enabling better handling of multiple scenes within videos and achieving higher fidelity and relevance in generated audio.

Contribution

The paper presents a novel V2A model with integrated scene detection, addressing the challenge of multiple scene recognition and improving audio generation quality.

Findings

01

Outperforms baseline in fidelity and relevance

02

Successfully recognizes and switches between multiple scenes

03

Achieves superior results on VGGSound dataset

Abstract

Video-to-audio (V2A) generation aims to produce corresponding audio given silent video inputs. This task is particularly challenging due to the cross-modality and sequential nature of the audio-visual features involved. Recent works have made significant progress in bridging the domain gap between video and audio, generating audio that is semantically aligned with the video content. However, a critical limitation of these approaches is their inability to effectively recognize and handle multiple scenes within a video, often leading to suboptimal audio generation in such cases. In this paper, we first reimplement a state-of-the-art V2A model with a slightly modified light-weight architecture, achieving results that outperform the baseline. We then propose an improved V2A model that incorporates a scene detector to address the challenge of switching between multiple visual scenes. Results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Vision and Imaging · Advanced Image and Video Retrieval Techniques