TL;DR
This paper introduces 3DVQL, the first benchmark for 3D multimodal visual query localization, along with baseline models and a novel fusion algorithm, to advance research in 3D spatial understanding.
Contribution
It presents the 3DVQL benchmark with multimodal data, manual annotations, and a new lift-and-attention fusion method, addressing the gap in 3D visual query localization research.
Findings
Existing methods vary significantly in performance across fusion modules.
The proposed LaF algorithm outperforms baseline models.
3DVQL benchmark includes 2,002 sequences with multimodal annotations.
Abstract
Visual query localization (VQL) aims to predict the spatio-temporal response of the most recent occurrence in a sequence given a query. Currently, most research focuses on visual query localization in 2D videos, while its counterpart in 3D space has received little attention. In this paper, we make the first attempt to address visual query localization in the 3D world by introducing a novel benchmark, dubbed 3DVQL. Specifically, 3DVQL contains 2,002 sequences with around 170,000 frames and 6.4K response track segments from 38 object categories. Each sequence in 3DVQL is provided with multiple modalities, including point clouds, RGB images, and depth images, to support flexible research. To ensure high-quality annotations, each sequence is manually annotated with multiple rounds of verification and refinement. To the best of our knowledge, 3DVQL is the first benchmark for 3D multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
