DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Zijia Lu; A S M Iftekhar; Gaurav Mittal; Tianjian Meng; Xiawei Wang; Cheng Zhao; Rohith Kukkala; Ehsan Elhamifar; Mei Chen

arXiv:2505.16376·cs.CV·May 23, 2025

DeCafNet: Delegate and Conquer for Efficient Temporal Grounding in Long Videos

Zijia Lu, A S M Iftekhar, Gaurav Mittal, Tianjian Meng, Xiawei Wang, Cheng Zhao, Rohith Kukkala, Ehsan Elhamifar, Mei Chen

PDF

Open Access 1 Repo

TL;DR

DeCafNet introduces a novel delegate-and-conquer approach for long video temporal grounding, significantly reducing computational costs while maintaining or improving accuracy through efficient feature extraction and multi-scale refinement.

Contribution

The paper proposes DeCafNet, a new method combining a sidekick encoder and a saliency map for efficient and accurate temporal grounding in long videos, outperforming existing methods in efficiency and accuracy.

Findings

01

Reduces computation by up to 47%

02

Outperforms existing methods on benchmark datasets

03

Establishes new state-of-the-art in efficiency and performance

Abstract

Long Video Temporal Grounding (LVTG) aims at identifying specific moments within lengthy videos based on user-provided text queries for effective content retrieval. The approach taken by existing methods of dividing video into clips and processing each clip via a full-scale expert encoder is challenging to scale due to prohibitive computational costs of processing a large number of clips in long videos. To address this issue, we introduce DeCafNet, an approach employing ``delegate-and-conquer'' strategy to achieve computation efficiency without sacrificing grounding performance. DeCafNet introduces a sidekick encoder that performs dense feature extraction over all video clips in a resource-efficient manner, while generating a saliency map to identify the most relevant clips for full processing by the expert encoder. To effectively leverage features from sidekick and expert encoders that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zijialewislu/cvpr2025-decafnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training