DroneVLA: VLA based Aerial Manipulation
Fawad Mehboob, Monijesu James, Amir Habel, Jeffrin Sam, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

TL;DR
DroneVLA presents an autonomous aerial manipulation system that interprets natural language commands to locate, navigate, grasp, and hand over objects using advanced vision-language reasoning and safe human interaction techniques.
Contribution
This work introduces a novel integration of vision-language reasoning, semantic navigation, and human-centric control for drone-based object manipulation from natural language commands.
Findings
Achieved precise localization with max error of 0.164m
Demonstrated effective natural language understanding for task prioritization
Validated safe human-drone interaction during object handover
Abstract
As aerial platforms evolve from passive observers to active manipulators, the challenge shifts toward designing intuitive interfaces that allow non-expert users to command these systems naturally. This work introduces a novel concept of autonomous aerial manipulation system capable of interpreting high-level natural language commands to retrieve objects and deliver them to a human user. The system is intended to integrate a MediaPipe based on Grounding DINO and a Vision-Language-Action (VLA) model with a custom-built drone equipped with a 1-DOF gripper and an Intel RealSense RGB-D camera. VLA performs semantic reasoning to interpret the intent of a user prompt and generates a prioritized task queue for grasping of relevant objects in the scene. Grounding DINO and dynamic A* planning algorithm are used to navigate and safely relocate the object. To ensure safe and natural interaction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Hand Gesture Recognition Systems
