Uni3DL: Unified Model for 3D and Language Understanding

Xiang Li; Jian Ding; Zhaoyang Chen; Mohamed Elhoseiny

arXiv:2312.03026·cs.CV·December 7, 2023·1 cites

Uni3DL: Unified Model for 3D and Language Understanding

Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny

PDF

Open Access

TL;DR

Uni3DL is a unified 3D and language understanding model operating directly on point clouds, enabling a wide range of tasks with a single architecture and achieving state-of-the-art performance.

Contribution

The paper introduces Uni3DL, a novel unified model that directly processes 3D point clouds for diverse vision and language tasks, expanding capabilities beyond existing multi-view based models.

Findings

01

Achieves performance comparable or superior to SOTA models across multiple 3D tasks.

02

Supports a broad spectrum of tasks including segmentation, detection, and cross-modal retrieval.

03

Demonstrates effective task sharing and decomposition within a unified architecture.

Abstract

In this work, we present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition