Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

Liu He; Xiao Zeng; Yizhi Song; Albert Y. C. Chen; Lu Xia; Shashwat Verma; Sankalp Dayal; Min Sun; Cheng-Hao Kuo; Daniel Aliaga

arXiv:2507.08513·cs.GR·July 25, 2025

Advancing Multimodal LLMs by Large-Scale 3D Visual Instruction Dataset Generation

Liu He, Xiao Zeng, Yizhi Song, Albert Y. C. Chen, Lu Xia, Shashwat Verma, Sankalp Dayal, Min Sun, Cheng-Hao Kuo, Daniel Aliaga

PDF

TL;DR

This paper introduces a large-scale synthetic 3D visual instruction dataset to improve multimodal large language models' understanding of camera-object relations, resulting in significant performance gains.

Contribution

We develop a novel synthetic data generation pipeline for 3D visual instructions, creating the Ultimate3D dataset and benchmark to enhance MLLMs' capabilities.

Findings

01

MLLMs fine-tuned on our dataset outperform commercial models by 33.4% accuracy.

02

The dataset includes 240K VQAs with detailed camera-object annotations.

03

Our approach significantly improves camera-object relation recognition.

Abstract

Multimodal Large Language Models (MLLMs) struggle with accurately capturing camera-object relations, especially for object orientation, camera viewpoint, and camera shots. This stems from the fact that existing MLLMs are trained on images with limited diverse camera-object relations and corresponding textual descriptions. To address this, we propose a synthetic generation pipeline to create large-scale 3D visual instruction datasets. Our framework takes 3D assets as input and uses rendering and diffusion-based image generation models to create photorealistic images preserving precise camera-object relations. Additionally, large language models (LLMs) are used to generate text prompts for guiding visual instruction tuning and controlling image generation. We create Ultimate3D, a dataset of 240K VQAs with precise camera-object annotations, and corresponding benchmark. MLLMs fine-tuned on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.