VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision   Understanding

Chris Kelly; Luhui Hu; Jiayin Hu; Yu Tian; Deshun Yang; Bang Yang,; Cindy Yang; Zihao Li; Zaoshan Huang; Yuexian Zou

arXiv:2403.09530·cs.CV·March 25, 2024·1 cites

VisionGPT-3D: A Generalized Multimodal Agent for Enhanced 3D Vision Understanding

Chris Kelly, Luhui Hu, Jiayin Hu, Yu Tian, Deshun Yang, Bang Yang,, Cindy Yang, Zihao Li, Zaoshan Huang, Yuexian Zou

PDF

Open Access

TL;DR

VisionGPT-3D introduces a unified multimodal framework that integrates state-of-the-art vision models to enhance 3D vision understanding from diverse inputs like text and images.

Contribution

It consolidates multiple SOTA vision models into a single framework, automates model selection, and optimizes 3D mesh generation from multimodal data.

Findings

01

Successfully integrates various vision models for 3D understanding

02

Automates selection of optimal 3D mesh algorithms

03

Achieves improved 3D reconstruction accuracy

Abstract

The evolution of text to visual components facilitates people's daily lives, such as generating image, videos from text and identifying the desired elements within the images. Computer vision models involving the multimodal abilities in the previous days are focused on image detection, classification based on well-defined objects. Large language models (LLMs) introduces the transformation from nature language to visual objects, which present the visual layout for text contexts. OpenAI GPT-4 has emerged as the pinnacle in LLMs, while the computer vision (CV) domain boasts a plethora of state-of-the-art (SOTA) models and algorithms to convert 2D images to their 3D representations. However, the mismatching between the algorithms with the problem could lead to undesired results. In response to this challenge, we propose an unified VisionGPT-3D framework to consolidate the state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · Robotics and Sensor-Based Localization

MethodsAttention Is All You Need · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization