Improving Visual Recognition using Ambient Sound for Supervision

Rohan Mahadev; Hongyu Lu

arXiv:1912.11659·cs.CV·December 30, 2019

Improving Visual Recognition using Ambient Sound for Supervision

Rohan Mahadev, Hongyu Lu

PDF

Open Access

TL;DR

This paper explores using ambient sound as a supervisory signal to improve visual recognition in videos, proposing enhancements to existing methods and demonstrating better performance in downstream tasks.

Contribution

It reproduces and extends Owens et al.'s approach by improving the sound-based pretext task for better visual recognition performance.

Findings

01

Enhanced visual recognition accuracy with sound supervision

02

Improved pretext task leads to better downstream results

03

Reproduction of prior experiments with added methodological improvements

Abstract

Our brains combine vision and hearing to create a more elaborate interpretation of the world. When the visual input is insufficient, a rich panoply of sounds can be used to describe our surroundings. Since more than 1,000 hours of videos are uploaded to the internet everyday, it is arduous, if not impossible, to manually annotate these videos. Therefore, incorporating audio along with visual data without annotations is crucial for leveraging this explosion of data for recognizing and understanding objects and scenes. Owens,et.al suggest that a rich representation of the physical world can be learned by using a convolutional neural network to predict sound textures associated with a given video frame. We attempt to reproduce the claims from their experiments, of which the code is not publicly available. In addition, we propose improvements in the pretext task that result in better…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization