Segmenting Collision Sound Sources in Egocentric Videos

Kranti Kumar Parida; Omar Emara; Hazel Doughty; Dima Damen

arXiv:2511.13863·cs.CV·November 21, 2025

Segmenting Collision Sound Sources in Egocentric Videos

Kranti Kumar Parida, Omar Emara, Hazel Doughty, Dima Damen

PDF

Open Access

TL;DR

This paper introduces Collision Sound Source Segmentation (CS3), a new task to identify objects responsible for collision sounds in egocentric videos, using weak supervision and foundation models, achieving significant improvements over baselines.

Contribution

The paper proposes the first weakly-supervised method for audio-conditioned segmentation of collision sound sources in egocentric videos, leveraging foundation models and egocentric cues.

Findings

01

Outperforms baselines by 3x and 4.7x in mIoU on two benchmarks

02

Introduces EPIC-CS3 and Ego4D-CS3 datasets for the CS3 task

03

Demonstrates effectiveness of foundation models and egocentric cues in collision sound segmentation

Abstract

Humans excel at multisensory perception and can often recognise object properties from the sound of their interactions. Inspired by this, we propose the novel task of Collision Sound Source Segmentation (CS3), where we aim to segment the objects responsible for a collision sound in visual input (i.e. video frames from the collision clip), conditioned on the audio. This task presents unique challenges. Unlike isolated sound events, a collision sound arises from interactions between two objects, and the acoustic signature of the collision depends on both. We focus on egocentric video, where sounds are often clear, but the visual scene is cluttered, objects are small, and interactions are brief. To address these challenges, we propose a weakly-supervised method for audio-conditioned segmentation, utilising foundation models (CLIP and SAM2). We also incorporate egocentric cues, i.e.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Multisensory perception and integration