Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models

Yifan Xu; Chao Zhang; Hanqi Jiang; Xiaoyan Wang; Ruifei Ma; Yiwei Li; Zihao Wu; Zeju Li; Xiangde Liu

arXiv:2507.12916·cs.CV·July 18, 2025

Argus: Leveraging Multiview Images for Improved 3-D Scene Understanding With Large Language Models

Yifan Xu, Chao Zhang, Hanqi Jiang, Xiaoyan Wang, Ruifei Ma, Yiwei Li, Zihao Wu, Zeju Li, Xiangde Liu

PDF

TL;DR

Argus introduces a multimodal framework that combines multi-view images, 3D point clouds, and text instructions to significantly improve 3D scene understanding capabilities of large language models.

Contribution

It is the first to integrate multi-view images with 3D point clouds and LLMs, creating a comprehensive 3D multimodal foundation model for enhanced scene understanding.

Findings

01

Outperforms existing 3D-LMMs in downstream tasks

02

Effectively compensates for information loss in 3D reconstructions

03

Enhances LLM understanding of complex 3D scenes

Abstract

Advancements in foundation models have made it possible to conduct applications in various downstream tasks. Especially, the new era has witnessed a remarkable capability to extend Large Language Models (LLMs) for tackling tasks of 3D scene understanding. Current methods rely heavily on 3D point clouds, but the 3D point cloud reconstruction of an indoor scene often results in information loss. Some textureless planes or repetitive patterns are prone to omission and manifest as voids within the reconstructed 3D point clouds. Besides, objects with complex structures tend to introduce distortion of details caused by misalignments between the captured images and the dense reconstructed point clouds. 2D multi-view images present visual consistency with 3D point clouds and provide more detailed representations of scene components, which can naturally compensate for these deficiencies. Based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.