Multi-Resolution Audio-Visual Feature Fusion for Temporal Action Localization
Edward Fish, Jon Weinbren, Andrew Gilbert

TL;DR
This paper proposes MRAV-FF, a novel hierarchical gated cross-attention method for multi-resolution audio-visual feature fusion, significantly improving temporal action localization accuracy by integrating audio cues across different temporal scales.
Contribution
It introduces a hierarchical gated cross-attention mechanism for effective multi-resolution audio-visual feature fusion in TAL, enhancing existing FPN architectures with audio integration.
Findings
Improved boundary regression accuracy
Enhanced classification confidence with audio data
Versatile integration with existing TAL architectures
Abstract
Temporal Action Localization (TAL) aims to identify actions' start, end, and class labels in untrimmed videos. While recent advancements using transformer networks and Feature Pyramid Networks (FPN) have enhanced visual feature recognition in TAL tasks, less progress has been made in the integration of audio features into such frameworks. This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF), an innovative method to merge audio-visual data across different temporal resolutions. Central to our approach is a hierarchical gated cross-attention mechanism, which discerningly weighs the importance of audio information at diverse temporal scales. Such a technique not only refines the precision of regression boundaries but also bolsters classification confidence. Importantly, MRAV-FF is versatile, making it compatible with existing FPN TAL architectures and offering a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
Methods1x1 Convolution · Convolution · Feature Pyramid Network
