Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets

Xulin Gu; Xinhao Zhong; Zhixing Wei; Yimin Zhou; Shuoyang Sun; Bin Chen; Hongpeng Wang; Yuan Luo

arXiv:2505.20694·cs.CV·May 28, 2025

Temporal Saliency-Guided Distillation: A Scalable Framework for Distilling Video Datasets

Xulin Gu, Xinhao Zhong, Zhixing Wei, Yimin Zhou, Shuoyang Sun, Bin Chen, Hongpeng Wang, Yuan Luo

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a scalable video dataset distillation framework that uses temporal saliency-guided filtering to efficiently compress video data while preserving essential temporal dynamics, achieving state-of-the-art results.

Contribution

The paper presents a novel uni-level distillation method with temporal saliency guidance, improving efficiency and performance in video dataset compression.

Findings

01

Achieves state-of-the-art performance on standard benchmarks.

02

Effectively preserves temporal dynamics in distilled videos.

03

Reduces computational costs compared to existing methods.

Abstract

Dataset distillation (DD) has emerged as a powerful paradigm for dataset compression, enabling the synthesis of compact surrogate datasets that approximate the training utility of large-scale ones. While significant progress has been achieved in distilling image datasets, extending DD to the video domain remains challenging due to the high dimensionality and temporal complexity inherent in video data. Existing video distillation (VD) methods often suffer from excessive computational costs and struggle to preserve temporal dynamics, as na\"ive extensions of image-based approaches typically lead to degraded performance. In this paper, we propose a novel uni-level video dataset distillation framework that directly optimizes synthetic videos with respect to a pre-trained model. To address temporal redundancy and enhance motion preservation, we introduce a temporal saliency-guided filtering…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. Proposed method is effective at pruning redundant frames, making the distillation process significantly faster and more tractable for massive datasets and achieves new state-of-the-art accuracy on multiple benchmarks. 2. The TSGF is a well-motivated addition that explicitly addresses the challenge of preserving temporal dynamics, a critical weakness in previous video distillation methods.

Weaknesses

1. Dataset distillation maintains performance with significantly smaller synthetic datasets; however, the baseline and its variants appear to underperform (e.g., 22.4% on K400). Since smaller models tend to overfit small-scale datasets, they may not adequately demonstrate the generalization capability of distilled datasets. While Table 4 shows that this method benefits larger models, could the authors further validate this finding using video models of standard scale rather than those with only

Reviewer 02Rating 4Confidence 4

Strengths

1. Conceptually clean and efficient framework: The uni-level training scheme is straightforward yet effective, avoiding iterative teacher-student feedback while maintaining strong performance. 2. Temporal Saliency Guided Filter (TSGF): The proposed TSGF provides a principled way to incorporate motion cues without requiring optical flow or explicit temporal modeling, which enhances both interpretability and efficiency. 3. Comprehensive empirical validation: The framework achieves consistent

Weaknesses

1. Insufficient positioning relative to decoupled dataset distillation methods. The pipeline (Sec. 3.2, Eq. (4)–(5)) resembles decoupled optimization schemes such as SRe2L, where a frozen teacher guides the synthetic data optimization. The distinction between the proposed uni-level framework and existing decoupled methods is not clearly explained, leaving the degree of conceptual novelty somewhat ambiguous. 2. Incomplete hyperparameter specification for the Temporal Saliency Guided Filter.

Reviewer 03Rating 4Confidence 4

Strengths

1. The TSGF mechanism is an innovative and lightweight way to capture temporal importance without relying on heavy optical flow or 3D convolutions. 2. The model design is computationally efficient, adopting uni-level optimization design instead of complex bi-level optimization, which substantially reduces memory footprint and training cost, making it suitable for large-scale video data distillation. 3. The method achieves state-of-the-art results on multiple standard benchmarks (UCF101, HMDB51,

Weaknesses

1. The paper lacks a formal theoretical definition and justification of “temporal saliency,” and the TSGF design appears heuristic, without mathematical modeling or convergence analysis. 2. Using raw inter-frame differences may not capture complex motion semantics and may fail in scenes with camera motion or background clutter. 3. Although the method claims potential applicability to other video tasks, only preliminary results are presented for temporal action segmentation, without validation on

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques