StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning

Huaijie Wang; De Cheng; Guozhang Li; Zhipeng Xu; Lingfeng He; Jie Li; Nannan Wang; Xinbo Gao

arXiv:2505.13997·cs.CV·October 1, 2025

StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning

Huaijie Wang, De Cheng, Guozhang Li, Zhipeng Xu, Lingfeng He, Jie Li, Nannan Wang, Xinbo Gao

PDF

3 Reviews

TL;DR

This paper introduces StPR, a novel exemplar-free framework for video class-incremental learning that preserves spatiotemporal information and dynamically routes task-specific experts, outperforming existing methods.

Contribution

The paper proposes a unified, exemplar-free VCIL approach combining spatiotemporal preservation with dynamic expert routing, addressing limitations of prior static and exemplar-based methods.

Findings

01

Outperforms existing VCIL baselines on UCF101, HMDB51, and Kinetics400.

02

Effectively preserves prior knowledge through semantic channel regularization.

03

Achieves improved interpretability and efficiency in video class-incremental learning.

Abstract

Video Class-Incremental Learning (VCIL) seeks to develop models that continuously learn new action categories over time without forgetting previously acquired knowledge. Unlike traditional Class-Incremental Learning (CIL), VCIL introduces the added complexity of spatiotemporal structures, making it particularly challenging to mitigate catastrophic forgetting while effectively capturing both frame-shared semantics and temporal dynamics. Existing approaches either rely on exemplar rehearsal, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling. To address these limitations, we propose Spatiotemporal Preservation and Routing (StPR), a unified and exemplar-free VCIL framework that explicitly disentangles and preserves spatiotemporal information. First, we introduce Frame-Shared Semantics Distillation (FSSD), which identifies…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1) The paper proposes an exemplar-free setting, which avoids the memory and privacy issues associated with storing historical data, making it more practical for real-world applications. 2) The theoretical analysis is substantial, providing derivations for the channel importance metrics (Fisher Information, classification contribution) used in FSSD. 3) The experimental analysis is thorough, validating the method's effectiveness across multiple datasets and various task partitions, and includes co

Weaknesses

1) Line 70: "reuses these decomposed components to enhance the model’s ability to adapt continually, thereby reducing forgetting without storing extensive exemplars." Is this statement problematic, as the subsequent text does not mention reusing these decomposed components? Furthermore, the claim that decomposed components help enhance adaptability is also debatable. 2) Why are channels with high semantic importance preserved? The semantic importance from a previous session does not necessarily

Reviewer 02Rating 6Confidence 4

Strengths

Strengths Originality: The disentanglement of frame-shared semantics and temporal dynamics for VCIL is novel. While MoE and distillation exist in CIL, their adaptation to video via temporal decomposition and semantic importance weighting is creative and domain-aware. Quality: Rigorous experiments with strong baselines, multiple datasets, and thorough ablations. The use of CLIP aligns with modern vision-language trends. Clarity: The framework is easy to follow (Figure 2), and algorithms are pr

Weaknesses

Task-Specific Expert Scaling: TD-MoE allocates one spatiotemporal encoder per task. While efficient per expert (~9M params), this leads to linear growth in parameters with the number of tasks (e.g., 10 tasks → ~90M extra params). This may limit scalability in long-task sequences. The paper does not discuss parameter efficiency in the long run or compare total model size vs. baselines like L2P or ST-Prompt. Dependence on CLIP: The method relies heavily on frozen CLIP features. While common, this

Reviewer 03Rating 6Confidence 4

Strengths

1. This paper introduces an effective rehearsal free method that can obtain state-of-the-art performance in different benchmarks like vCLIMB and TCD, outperforming both exemplar-based (e.g., FrameMaker, HCE) and exemplar-free (e.g., STSP, ST-Prompt) methods. 2. Temporal Decomposition-based Mixture-of-Experts (TD-MoE) provides a routing based on temporal residuals and attention-weighted deviations which allows the inference without task IDs. 3. They provide a well-designed distillation loss FSSD

Weaknesses

1. The proposed TD-MoE instantiates a separate expert for each new task. Consequently, both memory and computation costs are expected to scale linearly with the number of tasks. The paper does not provide an explicit analysis or discussion on this scalability issue, nor on potential mitigation strategies. 2. It would be valuable to understand the limitations of the TD-MoE in temporal challenging datasets like SSv2. For instances, analizing the forgetting of the model with and without the TD-MoE

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.