See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

Mahyar Ghazanfari; Peng Wei

arXiv:2604.13292·cs.CV·April 16, 2026

See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones

Mahyar Ghazanfari, Peng Wei

PDF

TL;DR

See&Say is a novel framework that combines geometric cues with semantic perception, guided by a Vision-Language Model, to improve safe zone detection for autonomous drone package delivery in complex environments.

Contribution

The paper introduces See&Say, integrating geometric safety cues with semantic reasoning via a Vision-Language Model for robust, iterative hazard detection and zone identification in drone delivery.

Findings

01

Outperforms baselines in safety map accuracy and IoU.

02

Successfully identifies alternative delivery zones when primary zones are unsafe.

03

Demonstrates effectiveness in dynamic urban delivery scenarios.

Abstract

Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.