HoverAI: An Embodied Aerial Agent for Natural Human-Drone Interaction
Yuhua Jin, Nikita Kuzmin, Georgii Demianchuk, Mariya Lezina, Fawad Mehboob, Issatay Tokmurziyev, Miguel Altamirano Cabrera, Muhammad Ahsan Mustafa, and Dzmitry Tsetserukou

TL;DR
HoverAI is a novel embodied aerial drone platform that combines mobility, visual projection, and conversational AI to enable natural, socially aware human-drone interactions in shared spaces.
Contribution
It introduces an integrated system with multimodal perception and adaptive visual and conversational responses, advancing human-drone interaction capabilities.
Findings
High command recognition accuracy (F1: 0.90)
Effective demographic estimation (gender F1: 0.89, age MAE: 5.14)
Accurate speech transcription (WER: 0.181)
Abstract
Drones operating in human-occupied spaces suffer from insufficient communication mechanisms that create uncertainty about their intentions. We present HoverAI, an embodied aerial agent that integrates drone mobility, infrastructure-independent visual projection, and real-time conversational AI into a unified platform. Equipped with a MEMS laser projector, onboard semi-rigid screen, and RGB camera, HoverAI perceives users through vision and voice, responding via lip-synced avatars that adapt appearance to user demographics. The system employs a multimodal pipeline combining VAD, ASR (Whisper), LLM-based intent classification, RAG for dialogue, face analysis for personalization, and voice synthesis (XTTS v2). Evaluation demonstrates high accuracy in command recognition (F1: 0.90), demographic estimation (gender F1: 0.89, age MAE: 5.14 years), and speech transcription (WER: 0.181). By…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Social Robot Interaction and HRI · Multimodal Machine Learning Applications
