Space-Time Memory Network for Sounding Object Localization in Videos

Sizhe Li; Yapeng Tian; Chenliang Xu

arXiv:2111.05526·cs.CV·November 11, 2021·1 cites

Space-Time Memory Network for Sounding Object Localization in Videos

Sizhe Li, Yapeng Tian, Chenliang Xu

PDF

Open Access

TL;DR

This paper introduces a space-time memory network that enhances the localization of sounding objects in videos by learning spatio-temporal attention across audio and visual data, improving robustness and accuracy.

Contribution

It presents a novel spatio-temporal attention mechanism within a memory network for joint audio-visual object localization, outperforming existing methods.

Findings

01

Effective in complex audio-visual scenes

02

Outperforms recent state-of-the-art methods

03

Demonstrates robustness through quantitative and qualitative analysis

Abstract

Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects. To this end, we propose a space-time memory network for sounding object localization in videos. It can simultaneously learn spatio-temporal attention over both uni-modal and cross-modal representations from audio and visual modalities. We show and analyze both quantitatively and qualitatively the effectiveness of incorporating spatio-temporal learning in localizing audio-visual objects. We demonstrate that our approach generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation

MethodsMemory Network