Segmenting Collision Sound Sources in Egocentric Videos
Kranti Kumar Parida, Omar Emara, Hazel Doughty, Dima Damen

TL;DR
This paper introduces Collision Sound Source Segmentation (CS3), a new task to identify objects responsible for collision sounds in egocentric videos, using weak supervision and foundation models, achieving significant improvements over baselines.
Contribution
The paper proposes the first weakly-supervised method for audio-conditioned segmentation of collision sound sources in egocentric videos, leveraging foundation models and egocentric cues.
Findings
Outperforms baselines by 3x and 4.7x in mIoU on two benchmarks
Introduces EPIC-CS3 and Ego4D-CS3 datasets for the CS3 task
Demonstrates effectiveness of foundation models and egocentric cues in collision sound segmentation
Abstract
Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Multisensory perception and integration
