FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

You Li; Dewei Zhou; Fan Ma; Fu Li; Dongliang He; Yi Yang

arXiv:2603.19857·cs.SD·April 21, 2026

FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts

You Li, Dewei Zhou, Fan Ma, Fu Li, Dongliang He, Yi Yang

PDF

TL;DR

FoleyDirector introduces a novel framework for fine-grained temporal control in video-to-audio generation, utilizing structured scripts and a new dataset to improve controllability without sacrificing audio quality.

Contribution

It presents FoleyDirector, the first system enabling precise temporal guidance in V2A, with new modules, datasets, and evaluation benchmarks for enhanced controllability.

Findings

01

Significantly improves temporal controllability in V2A.

02

Maintains high audio fidelity comparable to baseline models.

03

Enables seamless switching between generation modes.

Abstract

Recent Video-to-Audio (V2A) methods have achieved remarkable progress, enabling the synthesis of realistic, high-quality audio. However, they struggle with fine-grained temporal control in multi-event scenarios or when visual cues are insufficient, such as small regions, off-screen sounds, or occluded or partially visible objects. In this paper, we propose FoleyDirector, a framework that, for the first time, enables precise temporal guidance in DiT-based V2A generation while preserving the base model's audio quality and allowing seamless switching between V2A generation and temporally controlled synthesis. FoleyDirector introduces Structured Temporal Scripts (STS), a set of captions corresponding to short temporal segments, to provide richer temporal information. These features are integrated via the Script-Guided Temporal Fusion Module, which employs Temporal Script Attention to fuse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.