# Self-Supervised Audio-Visual Co-Segmentation

**Authors:** Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio, Torralba

arXiv: 1904.09013 · 2019-04-22

## TL;DR

This paper introduces a self-supervised neural network model that learns to segment objects in images and separate sound sources in videos by disentangling semantic concepts, outperforming baselines.

## Contribution

It presents a novel approach to disentangle semantic concepts in neural networks for joint audio-visual segmentation and separation without labeled data.

## Key findings

- Outperforms baseline methods in semantic segmentation
- Achieves superior sound source separation results
- Successfully disentangles concepts in neural network features

## Abstract

Segmenting objects in images and separating sound sources in audio are challenging tasks, in part because traditional approaches require large amounts of labeled data. In this paper we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds. Here, we introduce a learning approach to disentangle concepts in the neural networks, and assign semantic categories to network feature channels to enable independent image segmentation and sound source separation after audio-visual training on videos. Our evaluations show that the disentangled model outperforms several baselines in semantic segmentation and sound source separation.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.09013/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1904.09013/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/1904.09013/full.md

---
Source: https://tomesphere.com/paper/1904.09013