Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman

TL;DR
DenseAV introduces a self-supervised dual encoder architecture that learns to localize and associate sounds and words in videos without explicit supervision, outperforming prior methods on semantic segmentation and cross-modal retrieval.
Contribution
It proposes DenseAV, a novel architecture with a multi-head feature aggregation operator for unsupervised audio-visual grounding and localization.
Findings
DenseAV outperforms prior art on semantic segmentation tasks.
DenseAV surpasses ImageBind in cross-modal retrieval with fewer parameters.
The model automatically discovers sound and word associations without supervision.
Abstract
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization
