Space-Time Memory Network for Sounding Object Localization in Videos
Sizhe Li, Yapeng Tian, Chenliang Xu

TL;DR
This paper introduces a space-time memory network that enhances the localization of sounding objects in videos by learning spatio-temporal attention across audio and visual data, improving robustness and accuracy.
Contribution
It presents a novel spatio-temporal attention mechanism within a memory network for joint audio-visual object localization, outperforming existing methods.
Findings
Effective in complex audio-visual scenes
Outperforms recent state-of-the-art methods
Demonstrates robustness through quantitative and qualitative analysis
Abstract
Leveraging temporal synchronization and association within sight and sound is an essential step towards robust localization of sounding objects. To this end, we propose a space-time memory network for sounding object localization in videos. It can simultaneously learn spatio-temporal attention over both uni-modal and cross-modal representations from audio and visual modalities. We show and analyze both quantitatively and qualitatively the effectiveness of incorporating spatio-temporal learning in localizing audio-visual objects. We demonstrate that our approach generalizes over various complex audio-visual scenes and outperforms recent state-of-the-art methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Hearing Loss and Rehabilitation
MethodsMemory Network
