Enhancing the Spatial Awareness Capability of Multi-Modal Large Language   Model

Yongqiang Zhao; Zhenyu Li; Zhi Jin; Feng Zhang; Haiyan Zhao; Chengfeng; Dou; Zhengwei Tao; Xinhai Xu; Donghong Liu

arXiv:2310.20357·cs.AI·November 2, 2023·2 cites

Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model

Yongqiang Zhao, Zhenyu Li, Zhi Jin, Feng Zhang, Haiyan Zhao, Chengfeng, Dou, Zhengwei Tao, Xinhai Xu, Donghong Liu

PDF

Open Access

TL;DR

This paper improves the spatial awareness of multi-modal large language models by integrating precise geometric and scene graph information, leading to better performance in spatial understanding tasks across various benchmarks.

Contribution

The paper introduces a method that incorporates detailed spatial position data and scene graphs to enhance MLLM's spatial reasoning capabilities.

Findings

01

Significant improvement in spatial awareness tasks

02

Enhanced performance on benchmarks like MME and MM-Vet

03

Effective integration of geometric and scene graph data

Abstract

The Multi-Modal Large Language Model (MLLM) refers to an extension of the Large Language Model (LLM) equipped with the capability to receive and infer multi-modal data. Spatial awareness stands as one of the crucial abilities of MLLM, encompassing diverse skills related to understanding spatial relationships among objects and between objects and the scene area. Industries such as autonomous driving, smart healthcare, robotics, virtual, and augmented reality heavily demand MLLM's spatial awareness capabilities. However, there exists a noticeable gap between the current spatial awareness capabilities of MLLM and the requirements set by human needs. To address this issue, this paper proposes using more precise spatial position information between objects to guide MLLM in providing more accurate responses to user-related inquiries. Specifically, for a particular multi-modal task, we utilize…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Topic Modeling

MethodsSparse Evolutionary Training