HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model
Chen Li, Eric Peh, Basura Fernando

TL;DR
This paper introduces HMR3D, a hierarchical multimodal approach that explicitly aligns multi-view images and text descriptions with large vision-language models for improved 3D scene understanding.
Contribution
We propose a novel hierarchical multimodal representation that explicitly aligns 3D scene features with VLMs using multi-view images and spatially descriptive text.
Findings
Outperforms existing methods on 3D Q&A benchmarks.
Effectively captures local and global scene context.
Improves reasoning over complex 3D environments.
Abstract
Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
