Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding   of Sound and Language

Mark Hamilton; Andrew Zisserman; John R. Hershey; William T. Freeman

arXiv:2406.05629·cs.CV·June 11, 2024·1 cites

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language

Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman

PDF

Open Access 1 Repo 3 Models

TL;DR

DenseAV introduces a self-supervised dual encoder architecture that learns to localize and associate sounds and words in videos without explicit supervision, outperforming prior methods on semantic segmentation and cross-modal retrieval.

Contribution

It proposes DenseAV, a novel architecture with a multi-head feature aggregation operator for unsupervised audio-visual grounding and localization.

Findings

01

DenseAV outperforms prior art on semantic segmentation tasks.

02

DenseAV surpasses ImageBind in cross-modal retrieval with fewer parameters.

03

The model automatically discovers sound and word associations without supervision.

Abstract

We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mhamilton723/DenseAV
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization