DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification

Darryl Ho; Samuel Madden

arXiv:2506.12585·cs.CV·June 17, 2025

DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification

Darryl Ho, Samuel Madden

PDF

Open Access

TL;DR

DejaVid is a flexible, encoder-agnostic method that enhances video classification by learning temporal feature importance without retraining large models, significantly improving accuracy with minimal additional parameters.

Contribution

We introduce DejaVid, a novel approach that converts video embeddings into variable-length sequences and learns temporal feature weights, improving large encoder performance without architectural changes or extensive retraining.

Findings

01

Achieved top-1 accuracy of 77.2% on Something-Something V2

02

Improved Kinetics-400 accuracy to 89.1%

03

Added less than 1.8% parameters with under 3 hours of training

Abstract

In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis