DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model
Hao Yang, Hongbo Zhang, Yanyan Zhao, Bing Qin

TL;DR
DeepSight is a novel multimodal large language model specifically designed to improve 3D scene understanding by effectively integrating depth map information with language, utilizing a new dataset and modified encoder architecture.
Contribution
It introduces a dedicated depth-focused multimodal model, new datasets for depth instruction and image-text pairs, and a modified encoder to better capture depth information.
Findings
DeepSight significantly improves depth perception accuracy.
Enhanced performance on depth-related visual question answering.
Effective integration of depth maps boosts 3D scene understanding.
Abstract
Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning
