Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization
Dennis Fedorishin, Deen Dayal Mohan, Bhavin Jawade, Srirangaraj, Setlur, Venu Govindaraju

TL;DR
This paper introduces a novel optical flow-based self-supervised method for localizing sound sources in videos, leveraging motion information to improve accuracy and achieve state-of-the-art results on standard datasets.
Contribution
It proposes using optical flow as a prior for sound source localization, significantly enhancing attention maps and localization performance without explicit annotations.
Findings
Achieves state-of-the-art results on Soundnet Flickr dataset.
Demonstrates that flow-based attention improves localization accuracy.
Validates effectiveness on VGG Sound Source dataset.
Abstract
Learning to localize the sound source in videos without explicit annotations is a novel area of audio-visual research. Existing work in this area focuses on creating attention maps to capture the correlation between the two modalities to localize the source of the sound. In a video, oftentimes, the objects exhibiting movement are the ones generating the sound. In this work, we capture this characteristic by modeling the optical flow in a video as a prior to better aid in localizing the sound source. We further demonstrate that the addition of flow-based attention substantially improves visual sound source localization. Finally, we benchmark our method on standard sound source localization datasets and achieve state-of-the-art performance on the Soundnet Flickr and VGG Sound Source datasets. Code: https://github.com/denfed/heartheflow.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization· youtube
Hear The Flow: Optical Flow-Based Self-Supervised Visual Sound Source Localization· youtube
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
MethodsMax Pooling · Dense Connections · Dropout · Convolution · Softmax
