Uni3DL: Unified Model for 3D and Language Understanding
Xiang Li, Jian Ding, Zhaoyang Chen, Mohamed Elhoseiny

TL;DR
Uni3DL is a unified 3D and language understanding model operating directly on point clouds, enabling a wide range of tasks with a single architecture and achieving state-of-the-art performance.
Contribution
The paper introduces Uni3DL, a novel unified model that directly processes 3D point clouds for diverse vision and language tasks, expanding capabilities beyond existing multi-view based models.
Findings
Achieves performance comparable or superior to SOTA models across multiple 3D tasks.
Supports a broad spectrum of tasks including segmentation, detection, and cross-modal retrieval.
Demonstrates effective task sharing and decomposition within a unified architecture.
Abstract
In this work, we present Uni3DL, a unified model for 3D and Language understanding. Distinct from existing unified vision-language models in 3D which are limited in task variety and predominantly dependent on projected multi-view images, Uni3DL operates directly on point clouds. This approach significantly expands the range of supported tasks in 3D, encompassing both vision and vision-language tasks in 3D. At the core of Uni3DL, a query transformer is designed to learn task-agnostic semantic and mask outputs by attending to 3D visual features, and a task router is employed to selectively generate task-specific outputs required for diverse tasks. With a unified architecture, our Uni3DL model enjoys seamless task decomposition and substantial parameter sharing across tasks. Uni3DL has been rigorously evaluated across diverse 3D vision-language understanding tasks, including semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
