See&Say: Vision Language Guided Safe Zone Detection for Autonomous Package Delivery Drones
Mahyar Ghazanfari, Peng Wei

TL;DR
See&Say is a novel framework that combines geometric cues with semantic perception, guided by a Vision-Language Model, to improve safe zone detection for autonomous drone package delivery in complex environments.
Contribution
The paper introduces See&Say, integrating geometric safety cues with semantic reasoning via a Vision-Language Model for robust, iterative hazard detection and zone identification in drone delivery.
Findings
Outperforms baselines in safety map accuracy and IoU.
Successfully identifies alternative delivery zones when primary zones are unsafe.
Demonstrates effectiveness in dynamic urban delivery scenarios.
Abstract
Autonomous drone delivery systems are rapidly advancing, but ensuring safe and reliable package drop-offs remains highly challenging in cluttered urban and suburban environments where accurately identifying suitable package drop zones is critical. Existing approaches typically rely on either geometry-based analysis or semantic segmentation alone, but these methods lack the integrated semantic reasoning required for robust decision-making. To address this gap, we propose See&Say, a novel framework that combines geometric safety cues with semantic perception, guided by a Vision-Language Model (VLM) for iterative refinement. The system fuses monocular depth gradients with open-vocabulary detection masks to produce safety maps, while the VLM dynamically adjusts object category prompts and refines hazard detection across time, enabling reliable reasoning under dynamic conditions during the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
