Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation

Quanzhu Niu; Yikang Zhou; Shihao Chen; Tao Zhang; Shunping Ji

arXiv:2507.05948·cs.CV·July 15, 2025

Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation

Quanzhu Niu, Yikang Zhou, Shihao Chen, Tao Zhang, Shunping Ji

PDF

Open Access 1 Datasets

TL;DR

This paper introduces geometric cues, specifically monocular depth estimation, to improve the robustness of Video Instance Segmentation against occlusions and motion blur, achieving state-of-the-art results.

Contribution

It systematically explores three methods to incorporate depth information into VIS, demonstrating significant improvements with two of these approaches.

Findings

01

EDC and SV methods significantly improve VIS robustness

02

EDC achieves 56.2 AP with Swin-L backbone on OVIS benchmark

03

Depth cues are validated as critical for robust video understanding

Abstract

Video Instance Segmentation (VIS) fundamentally struggles with pervasive challenges including object occlusions, motion blur, and appearance variations during temporal association. To overcome these limitations, this work introduces geometric awareness to enhance VIS robustness by strategically leveraging monocular depth estimation. We systematically investigate three distinct integration paradigms. Expanding Depth Channel (EDC) method concatenates the depth map as input channel to segmentation networks; Sharing ViT (SV) designs a uniform ViT backbone, shared between depth estimation and segmentation branches; Depth Supervision (DS) makes use of depth prediction as an auxiliary training guide for feature learning. Though DS exhibits limited effectiveness, benchmark evaluations demonstrate that EDC and SV significantly enhance the robustness of VIS. When with Swin-L backbone, our EDC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

QuanzhuNiu/OVIS_RGBD
dataset· 21k dl
21k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Vision and Imaging · Human Pose and Action Recognition