Action Dubber: Timing Audible Actions via Inflectional Flow
Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He

TL;DR
This paper presents a new task called Audible Action Temporal Localization, focusing on identifying when and where audible movements occur in videos, using a novel inflectional flow approach without relying on audio input.
Contribution
It introduces $TA^{2}Net$, a new architecture that estimates inflectional flow for precise timing and localization of audible actions, along with a new dataset $Audible623$ for benchmarking.
Findings
$TA^{2}Net$ outperforms existing methods on $Audible623$
The approach generalizes well to sound source localization and counting tasks
The dataset enables focused research on audible action localization
Abstract
We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose , a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic Technology and Sound Studies · Data Visualization and Analytics · Human Motion and Animation
