DejaVid: Encoder-Agnostic Learned Temporal Matching for Video Classification
Darryl Ho, Samuel Madden

TL;DR
DejaVid is a flexible, encoder-agnostic method that enhances video classification by learning temporal feature importance without retraining large models, significantly improving accuracy with minimal additional parameters.
Contribution
We introduce DejaVid, a novel approach that converts video embeddings into variable-length sequences and learns temporal feature weights, improving large encoder performance without architectural changes or extensive retraining.
Findings
Achieved top-1 accuracy of 77.2% on Something-Something V2
Improved Kinetics-400 accuracy to 89.1%
Added less than 1.8% parameters with under 3 hours of training
Abstract
In recent years, large transformer-based video encoder models have greatly advanced state-of-the-art performance on video classification tasks. However, these large models typically process videos by averaging embedding outputs from multiple clips over time to produce fixed-length representations. This approach fails to account for a variety of time-related features, such as variable video durations, chronological order of events, and temporal variance in feature significance. While methods for temporal modeling do exist, they often require significant architectural changes and expensive retraining, making them impractical for off-the-shelf, fine-tuned large encoders. To overcome these limitations, we propose DejaVid, an encoder-agnostic method that enhances model performance without the need for retraining or altering the architecture. Our framework converts a video into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Generative Adversarial Networks and Image Synthesis
