A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data
Mauro Larrat, Claudomiro Sales

TL;DR
This paper introduces a multimodal Transformer model that fuses radar, video, infrared, and audio data for UAV detection and aerial object recognition, achieving high accuracy and real-time performance.
Contribution
The study presents a novel multimodal Transformer architecture that effectively integrates diverse data streams for improved UAV detection and classification.
Findings
Achieved macro-averaged accuracy of 0.9812 on test set
Demonstrated high precision and recall in distinguishing drones
Validated real-time inference speed of 41.11 FPS
Abstract
Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer's self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsUAV Applications and Optimization · Advanced SAR Imaging Techniques · Fire Detection and Safety Systems
