GroundNLQ @ Ego4D Natural Language Queries Challenge 2023
Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li,, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou

TL;DR
This paper presents GroundNLQ, a novel multi-modal grounding model for egocentric videos, achieving state-of-the-art results in the Ego4D NLQ Challenge 2023 through a two-stage pre-training and fine-tuning strategy.
Contribution
Introduction of GroundNLQ, a multi-scale multi-modal grounding model with a two-stage training approach for egocentric video-language understanding.
Findings
GroundNLQ outperforms all competing methods on the Ego4D NLQ benchmark.
The two-stage pre-training and fine-tuning strategy improves grounding accuracy.
GroundNLQ effectively handles long videos with multi-scale temporal modeling.
Abstract
In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at\url{https://github.com/houzhijian/GroundNLQ}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
