D\'ej\`a Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse

Jinwoo Hwang; Daeun Kim; Sangyeop Lee; Yoonsung Kim; Guseul Heo; Hojoon Kim; Yunseok Jeong; Tadiwos Meaza; Eunhyeok Park; Jeongseob Ahn; Jongse Park

arXiv:2506.14107·cs.DC·September 11, 2025

D\'ej\`a Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse

Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo, Hojoon Kim, Yunseok Jeong, Tadiwos Meaza, Eunhyeok Park, Jeongseob Ahn, Jongse Park

PDF

Open Access 1 Repo

TL;DR

D'je9 Vu is a system that speeds up video-language models by reusing computations across frames, making large-scale video querying more practical without sacrificing much accuracy.

Contribution

It introduces ReuseViT, a modified Vision Transformer that learns to identify inter-frame reuse opportunities, combined with memory-compute techniques for real performance gains.

Findings

01

Up to 2.64x acceleration in embedding generation

02

Maintains within 2% accuracy of baseline models

03

Effective for large-scale video analytics

Abstract

Recently, Video-Language Models (VideoLMs) have demonstrated remarkable capabilities, offering significant potential for flexible and powerful video query systems. These models typically rely on Vision Transformers (ViTs), which process video frames individually to extract visual embeddings. However, generating embeddings for large-scale videos requires ViT inferencing across numerous frames, posing a major hurdle to real-world deployment and necessitating solutions for integration into scalable video data management systems. This paper introduces D\'ej\`a Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames. At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities, striking an effective balance between accuracy and reuse. Although…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

casys-kaist/dejavu
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques