Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

Rodrigo Gutierrez Maquilon; Marita Hueber; Georg Regal; Manfred Tscheligi

arXiv:2602.15237·cs.HC·February 18, 2026

Ground-Truth Depth in Vision Language Models: Spatial Context Understanding in Conversational AI for XR-Robotic Support in Emergency First Response

Rodrigo Gutierrez Maquilon, Marita Hueber, Georg Regal, Manfred Tscheligi

PDF

Open Access

TL;DR

This paper demonstrates that integrating depth sensing with vision language models enhances spatial reasoning and situational awareness in emergency first response scenarios, leading to more accurate and stable distance estimations without increasing workload.

Contribution

It introduces a prototype that fuses depth sensing with vision language models to provide metrically grounded spatial information for emergency response tasks.

Findings

01

Depth-augmented VLM improves distance estimation accuracy.

02

Depth-augmentation raises situational awareness without increasing workload.

03

Depth-agnostic assistance can increase workload and reduce accuracy.

Abstract

Large language models (LLMs) are increasingly used in emergency first response (EFR) applications to support situational awareness (SA) and decision-making, yet most operate on text or 2D imagery and offer little support for core EFR SA competencies like spatial reasoning. We address this gap by evaluating a prototype that fuses robot-mounted depth sensing and YOLO detection with a vision language model (VLM) capable of verbalizing metrically-grounded distances of detected objects (e.g., the chair is 3.02 meters away). In a mixed-reality toxic-smoke scenario, participants estimated distances to a victim and an exit window under three conditions: video-only, depth-agnostic VLM, and depth-augmented VLM. Depth-augmentation improved objective accuracy and stability, e.g., the victim and window distance estimation error dropped, while raising situational awareness without increasing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human-Automation Interaction and Safety · Social Robot Interaction and HRI