Single-Stage Visual Query Localization in Egocentric Videos

Hanwen Jiang; Santhosh Kumar Ramakrishnan; Kristen Grauman

arXiv:2306.09324·cs.CV·June 16, 2023·6 cites

Single-Stage Visual Query Localization in Egocentric Videos

Hanwen Jiang, Santhosh Kumar Ramakrishnan, Kristen Grauman

PDF

Open Access 1 Video

TL;DR

VQLoC introduces a fast, end-to-end single-stage framework for visual query localization in egocentric videos, significantly improving accuracy and inference speed over prior multi-stage methods.

Contribution

The paper presents VQLoC, a novel single-stage, end-to-end trainable framework that jointly models query-video relationships for efficient spatio-temporal localization.

Findings

01

Outperforms prior methods by 20% accuracy

02

Achieves 10x faster inference speed

03

Top entry on Ego4D VQ2D challenge leaderboard

Abstract

Visual Query Localization on long-form egocentric videos requires spatio-temporal search and localization of visually specified objects and is vital to build episodic memory systems. Prior work develops complex multi-stage pipelines that leverage well-established object detection and tracking methods to perform VQL. However, each stage is independently trained and the complexity of the pipeline results in slow inference speeds. We propose VQLoC, a novel single-stage VQL framework that is end-to-end trainable. Our key idea is to first build a holistic understanding of the query-video relationship and then perform spatio-temporal localization in a single shot manner. Specifically, we establish the query-video relationship by jointly considering query-to-frame correspondences between the query and each video frame and frame-to-frame correspondences between nearby video frames. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Single-Stage Visual Query Localization in Egocentric Videos· slideslive

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning · Multimodal Machine Learning Applications