Tri-Modal Fusion Transformers for UAV-based Object Detection
Craig Iaboni, Pramod Abichandani

TL;DR
This paper introduces a tri-modal fusion framework using transformers for UAV object detection, integrating RGB, thermal, and event data, and provides a new dataset and systematic benchmark.
Contribution
It presents the first systematic study and modular framework for tri-modal UAV object detection with a new dataset and extensive ablation analysis.
Findings
Tri-modal fusion outperforms dual-modal baselines across various configurations.
Fusion depth significantly impacts detection performance.
A lightweight CSSA variant maintains most benefits with minimal computational cost.
Abstract
Reliable UAV object detection requires robustness to illumination changes, motion blur, and scene dynamics that suppress RGB cues. Thermal long-wave infrared (LWIR) sensing preserves contrast in low light, and event cameras retain microsecond-level temporal edges, but integrating all three modalities in a unified detector has not been systematically studied. We present a tri-modal framework that processes RGB, thermal, and event data with a dual-stream hierarchical vision transformer. At selected encoder depths, a Modality-Aware Gated Exchange (MAGE) applies inter-sensor channel and spatial gating, and a Bidirectional Token Exchange (BiTE) module performs bidirectional token-level attention with depthwise-pointwise refinement, producing resolution-preserving fused maps for a standard feature pyramid and two-stage detector. We introduce a 10,489-frame UAV dataset with synchronized and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
