GazeNLQ @ Ego4D Natural Language Queries Challenge 2025

Wei-Cheng Lin; Chih-Ming Lien; Chen Lo; Chia-Hung Yeh

arXiv:2506.05782·cs.CV·June 9, 2025

GazeNLQ @ Ego4D Natural Language Queries Challenge 2025

Wei-Cheng Lin, Chih-Ming Lien, Chen Lo, Chia-Hung Yeh

PDF

Open Access

TL;DR

This paper introduces GazeNLQ, a novel method that uses gaze data and contrastive learning to improve natural language query-based video retrieval in egocentric videos, achieving state-of-the-art localization accuracy.

Contribution

We propose GazeNLQ, a new approach that leverages gaze estimation and contrastive pretraining to enhance video segment retrieval from egocentric videos based on natural language queries.

Findings

01

Achieves [email protected] of 27.82

02

Achieves [email protected] of 18.68

03

Demonstrates improved localization accuracy using gaze data

Abstract

This report presents our solution to the Ego4D Natural Language Queries (NLQ) Challenge at CVPR 2025. Egocentric video captures the scene from the wearer's perspective, where gaze serves as a key non-verbal communication cue that reflects visual attention and offer insights into human intention and cognition. Motivated by this, we propose a novel approach, GazeNLQ, which leverages gaze to retrieve video segments that match given natural language queries. Specifically, we introduce a contrastive learning-based pretraining strategy for gaze estimation directly from video. The estimated gaze is used to augment video representations within proposed model, thereby enhancing localization accuracy. Experimental results show that GazeNLQ achieves [email protected] and [email protected] scores of 27.82 and 18.68, respectively. Our code is available at https://github.com/stevenlin510/GazeNLQ.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaze Tracking and Assistive Technology · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection