Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation
Hao Fang, Peng Wu, Yawei Li, Xinxin Zhang, and Xiankai Lu

TL;DR
This paper introduces OVFormer, a novel approach for open-vocabulary video instance segmentation that aligns embeddings and leverages temporal consistency, significantly improving performance and generalization over previous methods.
Contribution
OVFormer is a new baseline that uses unified embedding alignment and video-based training to enhance open-vocabulary VIS performance and generalization.
Findings
Achieves 21.9 mAP on LV-VIS with ResNet-50, surpassing previous state-of-the-art by 7.7.
Demonstrates strong zero-shot generalization on YouTube-VIS 2019 and OVIS datasets.
Utilizes a lightweight module for embedding alignment and semi-online inference for temporal consistency.
Abstract
Open-Vocabulary Video Instance Segmentation (VIS) is attracting increasing attention due to its ability to segment and track arbitrary objects. However, the recent Open-Vocabulary VIS attempts obtained unsatisfactory results, especially in terms of generalization ability of novel categories. We discover that the domain gap between the VLM features (e.g., CLIP) and the instance queries and the underutilization of temporal consistency are two central causes. To mitigate these issues, we design and train a novel Open-Vocabulary VIS baseline called OVFormer. OVFormer utilizes a lightweight module for unified embedding alignment between query embeddings and CLIP image embeddings to remedy the domain gap. Unlike previous image-based training methods, we conduct video-based model training and deploy a semi-online inference scheme to fully mine the temporal consistency in the video. Without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
