Exploiting Spatial-Temporal Semantic Consistency for Video Scene Parsing
Xingjian He, Weining Wang, Zhiyong Xu, Hao Wang, Jie Jiang, Jing Liu

TL;DR
This paper introduces a novel spatial-temporal semantic consistency approach for video scene parsing, leveraging a consistency loss and pseudo-labeling to improve accuracy and achieve top performance in a major challenge.
Contribution
It proposes a new method combining spatial-temporal consistency loss and pseudo-labeling, advancing the state-of-the-art in video scene parsing.
Findings
Achieved 59.84% mIoU on VSPW development set.
Achieved 58.85% mIoU on VSPW test set.
Won 1st place at ICCV2021 VSPW challenge.
Abstract
Compared with image scene parsing, video scene parsing introduces temporal information, which can effectively improve the consistency and accuracy of prediction. In this paper, we propose a Spatial-Temporal Semantic Consistency method to capture class-exclusive context information. Specifically, we design a spatial-temporal consistency loss to constrain the semantic consistency in spatial and temporal dimensions. In addition, we adopt an pseudo-labeling strategy to enrich the training dataset. We obtain the scores of 59.84% and 58.85% mIoU on development (test part 1) and testing set of VSPW, respectively. And our method wins the 1st place on VSPW challenge at ICCV2021.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
