Action Dubber: Timing Audible Actions via Inflectional Flow

Wenlong Wan; Weiying Zheng; Tianyi Xiang; Guiqing Li; Shengfeng He

arXiv:2506.13320·cs.CV·June 17, 2025

Action Dubber: Timing Audible Actions via Inflectional Flow

Wenlong Wan, Weiying Zheng, Tianyi Xiang, Guiqing Li, Shengfeng He

PDF

Open Access 1 Video

TL;DR

This paper presents a new task called Audible Action Temporal Localization, focusing on identifying when and where audible movements occur in videos, using a novel inflectional flow approach without relying on audio input.

Contribution

It introduces $TA^{2}Net$, a new architecture that estimates inflectional flow for precise timing and localization of audible actions, along with a new dataset $Audible623$ for benchmarking.

Findings

01

$TA^{2}Net$ outperforms existing methods on $Audible623$

02

The approach generalizes well to sound source localization and counting tasks

03

The dataset enables focused research on audible action localization

Abstract

We introduce the task of Audible Action Temporal Localization, which aims to identify the spatio-temporal coordinates of audible movements. Unlike conventional tasks such as action recognition and temporal action localization, which broadly analyze video content, our task focuses on the distinct kinematic dynamics of audible actions. It is based on the premise that key actions are driven by inflectional movements; for example, collisions that produce sound often involve abrupt changes in motion. To capture this, we propose $T A^{2} N e t$ , a novel architecture that estimates inflectional flow using the second derivative of motion to determine collision timings without relying on audio input. $T A^{2} N e t$ also integrates a self-supervised spatial localization strategy during training, combining contrastive learning with spatial analysis. This dual design improves temporal localization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Action Dubber: Timing Audible Actions via Inflectional Flow· slideslive

Taxonomy

TopicsMusic Technology and Sound Studies · Data Visualization and Analytics · Human Motion and Animation