VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval
Dhiman Paul, Md Rizwan Parvez, Nabeel Mohammed, Shafin Rahman

TL;DR
VideoLights is a novel transformer-based framework that enhances joint video highlight detection and moment retrieval by improving cross-modal alignment, feature refinement, and leveraging large vision-language models, achieving state-of-the-art results.
Contribution
It introduces a comprehensive framework with feature refinement, cross-modal fusion, feedback mechanisms, and LVLM integration, addressing key limitations of prior models in HD/MR tasks.
Findings
Surpasses existing baselines on QVHighlights, TVSum, and Charades-STA
Achieves new state-of-the-art performance in joint HD/MR tasks
Demonstrates effective use of LVLMs for multimodal feature integration
Abstract
Prevailing joint prediction transformers for Video Highlight Detection and Moment Retrieval (HD/MR) exhibit deficiencies in handling cross-task dynamics, achieving robust video-text alignment, and utilizing effective attention mechanisms, with the potential of Large Language/Vision-Language Models (LLMs/LVLMs) being largely untapped. This paper introduces VideoLights, a novel HD/MR framework addressing these limitations by incorporating: (i) Convolutional Projection and Feature Refinement modules with an alignment loss for enhanced video-text feature congruity; (ii) a Bi-Directional Cross-Modal Fusion network for strongly coupled query-aware representations; (iii) a Uni-directional joint-task feedback mechanism for synergistic task improvement; (iv) hard positive/negative losses for adaptive learning; and (v) the leveraging of LVLMs (e.g., BLIP-2) for superior multimodal feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods · Video Analysis and Summarization
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
