EchoFoley: Event-Centric Hierarchical Control for Video Grounded Creative Sound Generation
Bingxuan Li, Yiming Cui, Yicheng He, Yiwei Wang, Shu Zhang, Longyin Wen, Yulei Niu

TL;DR
EchoFoley introduces a hierarchical, event-centric approach for fine-grained, controllable video-grounded sound generation, addressing visual dominance and instruction understanding issues with a new benchmark and generation framework.
Contribution
The paper presents EchoFoley, a novel task and benchmark for detailed sound control in videos, along with EchoVidia, a new generation framework that improves controllability and quality.
Findings
EchoVidia outperforms recent VT2A models by 40.7% in controllability.
It achieves a 12.5% improvement in perceptual quality.
The benchmark contains over 6,000 annotated video-instruction triplets.
Abstract
Sound effects build an essential layer of multimodal storytelling, shaping the emotional atmosphere and the narrative semantics of videos. Despite recent advancement in video-text-to-audio (VT2A), the current formulation faces three key limitations: First, an imbalance between visual and textual conditioning that leads to visual dominance; Second, the absence of a concrete definition for fine-grained controllable generation; Third, weak instruction understanding and following, as existing datasets rely on brief categorical tags. To address these limitations, we introduce EchoFoley, a new task designed for video-grounded sound generation with both event level local control and hierarchical semantic control. Our symbolic representation for sounding events specifies when, what, and how each sound is produced within a video or instruction, enabling fine-grained controls like sound…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Music Technology and Sound Studies
