Unified Embedding Alignment for Open-Vocabulary Video Instance   Segmentation

Hao Fang; Peng Wu; Yawei Li; Xinxin Zhang; and Xiankai Lu

arXiv:2407.07427·cs.CV·July 15, 2024

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

Hao Fang, Peng Wu, Yawei Li, Xinxin Zhang, and Xiankai Lu

PDF

Open Access 1 Repo

TL;DR

This paper introduces OVFormer, a novel approach for open-vocabulary video instance segmentation that aligns embeddings and leverages temporal consistency, significantly improving performance and generalization over previous methods.

Contribution

OVFormer is a new baseline that uses unified embedding alignment and video-based training to enhance open-vocabulary VIS performance and generalization.

Findings

01

Achieves 21.9 mAP on LV-VIS with ResNet-50, surpassing previous state-of-the-art by 7.7.

02

Demonstrates strong zero-shot generalization on YouTube-VIS 2019 and OVIS datasets.

03

Utilizes a lightweight module for embedding alignment and semi-online inference for temporal consistency.

Abstract

Open-Vocabulary Video Instance Segmentation (VIS) is attracting increasing attention due to its ability to segment and track arbitrary objects. However, the recent Open-Vocabulary VIS attempts obtained unsatisfactory results, especially in terms of generalization ability of novel categories. We discover that the domain gap between the VLM features (e.g., CLIP) and the instance queries and the underutilization of temporal consistency are two central causes. To mitigate these issues, we design and train a novel Open-Vocabulary VIS baseline called OVFormer. OVFormer utilizes a lightweight module for unified embedding alignment between query embeddings and CLIP image embeddings to remedy the domain gap. Unlike previous image-based training methods, we conduct video-based model training and deploy a semi-online inference scheme to fully mine the temporal consistency in the video. Without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fanghaook/ovformer
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training