Improving Visual Recognition using Ambient Sound for Supervision
Rohan Mahadev, Hongyu Lu

TL;DR
This paper explores using ambient sound as a supervisory signal to improve visual recognition in videos, proposing enhancements to existing methods and demonstrating better performance in downstream tasks.
Contribution
It reproduces and extends Owens et al.'s approach by improving the sound-based pretext task for better visual recognition performance.
Findings
Enhanced visual recognition accuracy with sound supervision
Improved pretext task leads to better downstream results
Reproduction of prior experiments with added methodological improvements
Abstract
Our brains combine vision and hearing to create a more elaborate interpretation of the world. When the visual input is insufficient, a rich panoply of sounds can be used to describe our surroundings. Since more than 1,000 hours of videos are uploaded to the internet everyday, it is arduous, if not impossible, to manually annotate these videos. Therefore, incorporating audio along with visual data without annotations is crucial for leveraging this explosion of data for recognizing and understanding objects and scenes. Owens,et.al suggest that a rich representation of the physical world can be learned by using a convolutional neural network to predict sound textures associated with a given video frame. We attempt to reproduce the claims from their experiments, of which the code is not publicly available. In addition, we propose improvements in the pretext task that result in better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Video Analysis and Summarization
