3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose   Diffusion via Rectified Flow

Yueen Ma; Yuzheng Zhuang; Jianye Hao; Irwin King

arXiv:2501.16698·cs.CL·January 29, 2025

3D-MoE: A Mixture-of-Experts Multi-modal LLM for 3D Vision and Pose Diffusion via Rectified Flow

Yueen Ma, Yuzheng Zhuang, Jianye Hao, Irwin King

PDF

Open Access

TL;DR

This paper introduces 3D-MoE, a multi-modal large language model with a mixture-of-experts architecture and a diffusion head, enhancing 3D vision and pose reasoning with improved efficiency and performance.

Contribution

It converts existing LLMs into MoE models for multi-modal 3D data processing and integrates a diffusion head for embodied task planning.

Findings

01

Improved performance on 3D question answering

02

Fewer activated parameters for comparable results

03

Effective multi-modal 3D reasoning

Abstract

3D vision and spatial reasoning have long been recognized as preferable for accurately perceiving our three-dimensional world, especially when compared with traditional visual reasoning based on 2D images. Due to the difficulties in collecting high-quality 3D data, research in this area has only recently gained momentum. With the advent of powerful large language models (LLMs), multi-modal LLMs for 3D vision have been developed over the past few years. However, most of these models focus primarily on the vision encoder for 3D data. In this paper, we propose converting existing densely activated LLMs into mixture-of-experts (MoE) models, which have proven effective for multi-modal data processing. In addition to leveraging these models' instruction-following capabilities, we further enable embodied task planning by attaching a diffusion head, Pose-DiT, that employs a novel rectified flow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image and Object Detection Techniques · Robot Manipulation and Learning

MethodsDiffusion · Focus