EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Bingxuan Li; Yiming Cui; Yicheng He; Yiwei Wang; Shu Zhang; Longyin Wen; Yulei Niu

arXiv:2512.24731·cs.CV·January 1, 2026

EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation

Bingxuan Li, Yiming Cui, Yicheng He, Yiwei Wang, Shu Zhang, Longyin Wen, Yulei Niu

PDF

Open Access

TL;DR

EchoFoley introduces a hierarchical, event-centric approach for fine-grained, controllable video-grounded sound generation, addressing visual dominance and instruction understanding issues with a new benchmark and generation framework.

Contribution

The paper presents EchoFoley, a novel task and benchmark for detailed sound control in videos, along with EchoVidia, a new generation framework that improves controllability and quality.

Findings

01

EchoVidia outperforms recent VT2A models by 40.7% in controllability.

02

It achieves a 12.5% improvement in perceptual quality.

03

The benchmark contains over 6,000 annotated video-instruction triplets.

Abstract

Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music Technology and Sound Studies