Leveraging YOLO-World and GPT-4V LMMs for Zero-Shot Person Detection and Action Recognition in Drone Imagery
Christian Limberg, Artur Gon\c{c}alves, Bastien Rigault, Helmut, Prendinger

TL;DR
This paper investigates the use of zero-shot Large Multimodal Models, YOLO-World and GPT-4V, for person detection and action recognition in drone imagery, highlighting their potential and limitations in aerial perception tasks.
Contribution
It is the first study to evaluate LMMs like YOLO-World and GPT-4V for drone-based perception, demonstrating their capabilities and challenges in zero-shot person detection and scene understanding.
Findings
YOLO-World shows good detection performance.
GPT-4V effectively filters region proposals and describes scenes.
GPT-4V has limited accuracy in classifying actions.
Abstract
In this article, we explore the potential of zero-shot Large Multimodal Models (LMMs) in the domain of drone perception. We focus on person detection and action recognition tasks and evaluate two prominent LMMs, namely YOLO-World and GPT-4V(ision) using a publicly available dataset captured from aerial views. Traditional deep learning approaches rely heavily on large and high-quality training datasets. However, in certain robotic settings, acquiring such datasets can be resource-intensive or impractical within a reasonable timeframe. The flexibility of prompt-based Large Multimodal Models (LMMs) and their exceptional generalization capabilities have the potential to revolutionize robotics applications in these scenarios. Our findings suggest that YOLO-World demonstrates good detection performance. GPT-4V struggles with accurately classifying action classes but delivers promising results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Advanced Neural Network Applications · Fire Detection and Safety Systems
MethodsFocus
