Towards Open-Vocabulary Video Semantic Segmentation

Xinhao Li; Yun Liu; Guolei Sun; Min Wu; Le Zhang; Ce Zhu

arXiv:2412.09329·cs.MM·December 13, 2024

Towards Open-Vocabulary Video Semantic Segmentation

Xinhao Li, Yun Liu, Guolei Sun, Min Wu, Le Zhang, Ce Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces the OV-VSS task for open-vocabulary video semantic segmentation, proposing a baseline model that leverages spatial-temporal fusion, frame enhancement, and text encoding to improve zero-shot generalization to novel categories.

Contribution

The paper defines the new OV-VSS task and presents OV2VSS, a robust baseline model with innovative modules for better open-vocabulary video segmentation.

Findings

01

Demonstrates improved zero-shot segmentation on VSPW and Cityscapes datasets.

02

Validates the effectiveness of spatial-temporal fusion and text encoding modules.

03

Shows strong generalization to unseen categories in video segmentation.

Abstract

Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AVC2-UESTC/OV2VSS
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization