Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model
Yongqiang Zhao, Zhenyu Li, Zhi Jin, Feng Zhang, Haiyan Zhao, Chengfeng, Dou, Zhengwei Tao, Xinhai Xu, Donghong Liu

TL;DR
This paper improves the spatial awareness of multi-modal large language models by integrating precise geometric and scene graph information, leading to better performance in spatial understanding tasks across various benchmarks.
Contribution
The paper introduces a method that incorporates detailed spatial position data and scene graphs to enhance MLLM's spatial reasoning capabilities.
Findings
Significant improvement in spatial awareness tasks
Enhanced performance on benchmarks like MME and MM-Vet
Effective integration of geometric and scene graph data
Abstract
The Multi-Modal Large Language Model (MLLM) refers to an extension of the Large Language Model (LLM) equipped with the capability to receive and infer multi-modal data. Spatial awareness stands as one of the crucial abilities of MLLM, encompassing diverse skills related to understanding spatial relationships among objects and between objects and the scene area. Industries such as autonomous driving, smart healthcare, robotics, virtual, and augmented reality heavily demand MLLM's spatial awareness capabilities. However, there exists a noticeable gap between the current spatial awareness capabilities of MLLM and the requirements set by human needs. To address this issue, this paper proposes using more precise spatial position information between objects to guide MLLM in providing more accurate responses to user-related inquiries. Specifically, for a particular multi-modal task, we utilize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Topic Modeling
MethodsSparse Evolutionary Training
