HoverAI: An Embodied Aerial Agent for Natural Human-Drone Interaction

Yuhua Jin; Nikita Kuzmin; Georgii Demianchuk; Mariya Lezina; Fawad Mehboob; Issatay Tokmurziyev; Miguel Altamirano Cabrera; Muhammad Ahsan Mustafa; and Dzmitry Tsetserukou

arXiv:2601.13801·cs.RO·January 21, 2026

HoverAI: An Embodied Aerial Agent for Natural Human-Drone Interaction

Yuhua Jin, Nikita Kuzmin, Georgii Demianchuk, Mariya Lezina, Fawad Mehboob, Issatay Tokmurziyev, Miguel Altamirano Cabrera, Muhammad Ahsan Mustafa, and Dzmitry Tsetserukou

PDF

Open Access

TL;DR

HoverAI is a novel embodied aerial drone platform that combines mobility, visual projection, and conversational AI to enable natural, socially aware human-drone interactions in shared spaces.

Contribution

It introduces an integrated system with multimodal perception and adaptive visual and conversational responses, advancing human-drone interaction capabilities.

Findings

01

High command recognition accuracy (F1: 0.90)

02

Effective demographic estimation (gender F1: 0.89, age MAE: 5.14)

03

Accurate speech transcription (WER: 0.181)

Abstract

Drones operating in human-occupied spaces suffer from insufficient communication mechanisms that create uncertainty about their intentions. We present HoverAI, an embodied aerial agent that integrates drone mobility, infrastructure-independent visual projection, and real-time conversational AI into a unified platform. Equipped with a MEMS laser projector, onboard semi-rigid screen, and RGB camera, HoverAI perceives users through vision and voice, responding via lip-synced avatars that adapt appearance to user demographics. The system employs a multimodal pipeline combining VAD, ASR (Whisper), LLM-based intent classification, RAG for dialogue, face analysis for personalization, and voice synthesis (XTTS v2). Evaluation demonstrates high accuracy in command recognition (F1: 0.90), demographic estimation (gender F1: 0.89, age MAE: 5.14 years), and speech transcription (WER: 0.181). By…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Social Robot Interaction and HRI · Multimodal Machine Learning Applications