Collaborative Hybrid Propagator for Temporal Misalignment in Audio-Visual Segmentation
Kexin Li, Zongxin Yang, Yi Yang, Jun Xiao

TL;DR
This paper introduces a novel framework for audio-visual segmentation that effectively addresses temporal misalignment by identifying audio semantic change points and propagating segmentation frames accordingly, improving alignment accuracy.
Contribution
The proposed Collaborative Hybrid Propagator Framework uniquely combines audio boundary detection with frame-by-frame propagation, enhancing temporal alignment in AVVS tasks.
Findings
Improves alignment accuracy across three datasets
Reduces memory usage compared to traditional methods
Can be integrated with existing AVVS approaches
Abstract
Audio-visual video segmentation (AVVS) aims to generate pixel-level maps of sound-producing objects that accurately align with the corresponding audio. However, existing methods often face temporal misalignment, where audio cues and segmentation results are not temporally coordinated. Audio provides two critical pieces of information: i) target object-level details and ii) the timing of when objects start and stop producing sounds. Current methods focus more on object-level information but neglect the boundaries of audio semantic changes, leading to temporal misalignment. To address this issue, we propose a Collaborative Hybrid Propagator Framework~(Co-Prop). This framework includes two main steps: Preliminary Audio Boundary Anchoring and Frame-by-Frame Audio-Insert Propagation. To Anchor the audio boundary, we employ retrieval-assist prompts with Qwen large language models to identify…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsALIGN · Focus
