Unifying 3D Vision-Language Understanding via Promptable Queries

Ziyu Zhu; Zhuofan Zhang; Xiaojian Ma; Xuesong Niu; Yixin Chen,; Baoxiong Jia; Zhidong Deng; Siyuan Huang; Qing Li

arXiv:2405.11442·cs.CV·July 25, 2024

Unifying 3D Vision-Language Understanding via Promptable Queries

Ziyu Zhu, Zhuofan Zhang, Xiaojian Ma, Xuesong Niu, Yixin Chen,, Baoxiong Jia, Zhidong Deng, Siyuan Huang, Qing Li

PDF

Open Access

TL;DR

This paper introduces PQ3D, a unified 3D vision-language model that leverages promptable queries to handle diverse 3D scene representations and tasks, achieving state-of-the-art results across multiple benchmarks.

Contribution

The paper proposes a novel unified framework with promptable queries, unifying various 3D scene representations and enabling multi-task learning for 3D vision-language understanding.

Findings

01

Achieves new state-of-the-art on multiple 3D-VL benchmarks.

02

Supports flexible inference with different 3D representations.

03

Demonstrates significant performance improvements over existing methods.

Abstract

A unified model for 3D vision-language (3D-VL) understanding is expected to take various scene representations and perform a wide range of tasks in a 3D scene. However, a considerable gap exists between existing methods and such a unified model, due to the independent application of representation and insufficient exploration of 3D multi-task training. In this paper, we introduce PQ3D, a unified model capable of using Promptable Queries to tackle a wide range of 3D-VL tasks, from low-level instance segmentation to high-level reasoning and planning. This is achieved through three key innovations: (1) unifying various 3D scene representations (i.e., voxels, point clouds, multi-view images) into a shared 3D coordinate space by segment-level grouping, (2) an attention-based query decoder for task-specific information retrieval guided by prompts, and (3) universal output heads for different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Constraint Satisfaction and Optimization