THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation

Mingqi Gao; Haoran Duan; Tianlu Zhang; Jungong Han

arXiv:2506.06748·cs.CV·June 10, 2025

THU-Warwick Submission for EPIC-KITCHEN Challenge 2025: Semi-Supervised Video Object Segmentation

Mingqi Gao, Haoran Duan, Tianlu Zhang, Jungong Han

PDF

Open Access

TL;DR

This paper presents a semi-supervised egocentric video object segmentation method that combines large-scale visual pretraining with depth cues, achieving high accuracy on the VISOR benchmark.

Contribution

It introduces a novel framework integrating SAM2 pretraining and depth cues for improved long-term segmentation in egocentric videos.

Findings

01

J&F score of 90.1% on VISOR test set

02

Effective handling of complex scenes and long-term tracking

03

Integration of visual pretraining with depth cues

Abstract

In this report, we describe our approach to egocentric video object segmentation. Our method combines large-scale visual pretraining from SAM2 with depth-based geometric cues to handle complex scenes and long-term tracking. By integrating these signals in a unified framework, we achieve strong segmentation performance. On the VISOR test set, our method reaches a J&F score of 90.1%.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Advanced Neural Network Applications