Time2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation

Siyu Chen; Ting Han; Haoling Huang; Chaolei Wang; Chengzheng Fu; Duxin Zhu; Guorong Cai; Jinhe Su

arXiv:2602.09648·cs.CV·February 24, 2026

Time2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation

Siyu Chen, Ting Han, Haoling Huang, Chaolei Wang, Chengzheng Fu, Duxin Zhu, Guorong Cai, Jinhe Su

PDF

Open Access

TL;DR

Time2General introduces a novel framework for domain-generalized video semantic segmentation that enhances temporal stability and cross-domain accuracy without requiring target labels or test-time adaptation.

Contribution

It proposes a Spatio-Temporal Memory Decoder and Masked Temporal Consistency Loss to improve temporal stability and robustness in domain-generalized video segmentation.

Findings

01

Significant improvement in cross-domain accuracy.

02

Enhanced temporal stability with reduced flicker.

03

Operates at up to 18 FPS on benchmarks.

Abstract

Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications · Human Pose and Action Recognition