THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation
Mingqi Gao, Haoran Duan, Tianlu Zhang, Jungong Han

TL;DR
This paper presents a semi-supervised egocentric video object segmentation method that combines large-scale visual pretraining with depth cues, achieving high accuracy on the VISOR benchmark.
Contribution
It introduces a novel framework integrating SAM2 pretraining and depth cues for improved long-term segmentation in egocentric videos.
Findings
J&F score of 90.1% on VISOR test set
Effective handling of complex scenes and long-term tracking
Integration of visual pretraining with depth cues
Abstract
In this report, we describe our approach to egocentric video object segmentation. Our method combines large-scale visual pretraining from SAM2 with depth-based geometric cues to handle complex scenes and long-term tracking. By integrating these signals in a unified framework, we achieve strong segmentation performance. On the VISOR test set, our method reaches a J&F score of 90.1%.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Advanced Neural Network Applications
