HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

Chen Li; Eric Peh; Basura Fernando

arXiv:2511.22961·cs.CV·December 1, 2025

HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model

Chen Li, Eric Peh, Basura Fernando

PDF

Open Access

TL;DR

This paper introduces HMR3D, a hierarchical multimodal approach that explicitly aligns multi-view images and text descriptions with large vision-language models for improved 3D scene understanding.

Contribution

We propose a novel hierarchical multimodal representation that explicitly aligns 3D scene features with VLMs using multi-view images and spatially descriptive text.

Findings

01

Outperforms existing methods on 3D Q&A benchmarks.

02

Effectively captures local and global scene context.

03

Improves reasoning over complex 3D environments.

Abstract

Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications