Learning Visual Styles from Audio-Visual Associations
Tingle Li, Yichen Liu, Andrew Owens, Hang Zhao

TL;DR
This paper introduces a novel method for learning visual styles from unlabeled audio-visual data, enabling manipulation of image textures based on sound cues, outperforming label-based methods in various evaluations.
Contribution
It presents a new approach for audio-driven image stylization that learns to modify images to match sounds without requiring labeled data.
Findings
Sound-based model outperforms label-based approaches
Audio manipulation leads to predictable visual style changes
Method effectively learns visual textures from audio-visual pairs
Abstract
From the patter of rain to the crunch of snow, the sounds we hear often convey the visual textures that appear within a scene. In this paper, we present a method for learning visual styles from unlabeled audio-visual data. Our model learns to manipulate the texture of a scene to match a sound, a problem we term audio-driven image stylization. Given a dataset of paired audio-visual data, we learn to modify input images such that, after manipulation, they are more likely to co-occur with a given input sound. In quantitative and qualitative evaluations, our sound-based model outperforms label-based approaches. We also show that audio can be an intuitive representation for manipulating images, as adjusting a sound's volume or mixing two sounds together results in predictable changes to visual style. Project webpage: https://tinglok.netlify.app/files/avstyle
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Animal Vocal Communication and Behavior
