Joint Object-Material Category Segmentation from Audio-Visual Cues
Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin,, Ondrej Miksik, Shahram Izadi, Philip Torr

TL;DR
This paper introduces a joint audio-visual approach for dense object and material segmentation, leveraging sparse auditory cues alongside visual data to improve accuracy in scene understanding.
Contribution
It proposes a novel multi-output labeling framework that combines visual and auditory cues using a random-field model for enhanced scene analysis.
Findings
Joint audio-visual cues improve segmentation accuracy
The method outperforms visual-only approaches
New dataset with paired visual and auditory data is introduced
Abstract
It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials. In this paper, we therefore present an approach that augments the available dense visual cues with sparse auditory cues in order to estimate dense object and material labels. Since estimates of object class and material properties are mutually informative, we optimise our multi-output labelling jointly using a random-field framework. We evaluate our system on a new dataset with paired visual and auditory data that we make publicly available. We demonstrate that this joint estimation of object and material labels significantly outperforms the estimation of either category in isolation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Image and Video Retrieval Techniques
