Towards Open-Vocabulary Video Semantic Segmentation
Xinhao Li, Yun Liu, Guolei Sun, Min Wu, Le Zhang, Ce Zhu

TL;DR
This paper introduces the OV-VSS task for open-vocabulary video semantic segmentation, proposing a baseline model that leverages spatial-temporal fusion, frame enhancement, and text encoding to improve zero-shot generalization to novel categories.
Contribution
The paper defines the new OV-VSS task and presents OV2VSS, a robust baseline model with innovative modules for better open-vocabulary video segmentation.
Findings
Demonstrates improved zero-shot segmentation on VSPW and Cityscapes datasets.
Validates the effectiveness of spatial-temporal fusion and text encoding modules.
Shows strong generalization to unseen categories in video segmentation.
Abstract
Semantic segmentation in videos has been a focal point of recent research. However, existing models encounter challenges when faced with unfamiliar categories. To address this, we introduce the Open Vocabulary Video Semantic Segmentation (OV-VSS) task, designed to accurately segment every pixel across a wide range of open-vocabulary categories, including those that are novel or previously unexplored. To enhance OV-VSS performance, we propose a robust baseline, OV2VSS, which integrates a spatial-temporal fusion module, allowing the model to utilize temporal relationships across consecutive frames. Additionally, we incorporate a random frame enhancement module, broadening the model's understanding of semantic context throughout the entire video sequence. Our approach also includes video text encoding, which strengthens the model's capability to interpret textual information within the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
