VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Dhiman Paul; Md Rizwan Parvez; Nabeel Mohammed; Shafin Rahman

arXiv:2412.01558·cs.CV·November 25, 2025

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

VideoLights is a novel transformer-based framework that enhances joint video highlight detection and moment retrieval by improving cross-modal alignment, feature refinement, and leveraging large vision-language models, achieving state-of-the-art results.

Contribution

It introduces a comprehensive framework with feature refinement, cross-modal fusion, feedback mechanisms, and LVLM integration, addressing key limitations of prior models in HD/MR tasks.

Findings

01

Surpasses existing baselines on QVHighlights, TVSum, and Charades-STA

02

Achieves new state-of-the-art performance in joint HD/MR tasks

03

Demonstrates effective use of LVLMs for multimodal feature integration

Abstract

Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dpaul06/VideoLights
pytorchOfficial

Models

🤗
dpaul06/VideoLights
model

Datasets

dpaul06/VideoLights
dataset· 59 dl
59 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Video Analysis and Summarization

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training