Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

Jiahua Dong; Hui Yin; Wenqi Liang; Hanbin Zhao; Henghui Ding; Nicu Sebe; Salman Khan; Fahad Shahbaz Khan

arXiv:2508.08612·cs.CV·August 13, 2025

Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation

Jiahua Dong, Hui Yin, Wenqi Liang, Hanbin Zhao, Henghui Ding, Nicu Sebe, Salman Khan, Fahad Shahbaz Khan

PDF

Open Access

TL;DR

This paper introduces HVPL, a hierarchical visual prompt learning model that effectively mitigates catastrophic forgetting in continual video instance segmentation by leveraging frame-level and video-level prompts and context decoding.

Contribution

The paper proposes a novel hierarchical prompt learning framework with task-specific prompts and orthogonal gradient correction to address continual learning challenges in VIS.

Findings

01

HVPL outperforms baseline methods in continual VIS tasks.

02

The orthogonal gradient correction improves retention of old class knowledge.

03

Video context decoding enhances inter-class relationship modeling.

Abstract

Video instance segmentation (VIS) has gained significant attention for its capability in tracking and segmenting object instances across video frames. However, most of the existing VIS approaches unrealistically assume that the categories of object instances remain fixed over time. Moreover, they experience catastrophic forgetting of old classes when required to continuously learn object instances belonging to new categories. To resolve these challenges, we develop a novel Hierarchical Visual Prompt Learning (HVPL) model that overcomes catastrophic forgetting of previous categories from both frame-level and video-level perspectives. Specifically, to mitigate forgetting at the frame level, we devise a task-specific frame prompt and an orthogonal gradient correction (OGC) module. The OGC module helps the frame prompt encode task-specific global instance information for new classes in each…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications