LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters

Haomin Zhang; Kristin Qi; Shuxin Yang; Zihao Chen; Chaofan Ding; Xinhan Di

arXiv:2508.11074·cs.SD·August 18, 2025

LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters

Haomin Zhang, Kristin Qi, Shuxin Yang, Zihao Chen, Chaofan Ding, Xinhan Di

PDF

TL;DR

This paper introduces LD-LAudio-V1, a novel model with dual lightweight adapters for generating high-quality, temporally synchronized long-form audio from videos, addressing limitations of short-form focus and noisy datasets in previous methods.

Contribution

We propose LD-LAudio-V1, an extension of existing models with dual lightweight adapters for long-form video-to-audio generation and release a clean, annotated dataset to facilitate future research.

Findings

01

Significant improvements in multiple audio quality metrics.

02

Reduction of splicing artifacts and temporal inconsistencies.

03

Enhanced computational efficiency in long-form audio generation.

Abstract

Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.