Great Ape Detection in Challenging Jungle Camera Trap Footage via Attention-Based Spatial and Temporal Feature Blending
Xinyu Yang, Majid Mirmehdi, Tilo Burghardt

TL;DR
This paper introduces a novel multi-frame video detection framework with attention-based feature blending for identifying great apes in challenging jungle camera trap footage, significantly improving detection robustness.
Contribution
The paper presents the first multi-frame detection method incorporating self-attention for spatial and temporal feature blending in wildlife monitoring.
Findings
Outperforms frame-based detectors in challenging conditions
Achieves high robustness on real-world camera trap data
Demonstrates effectiveness on large-scale annotated datasets
Abstract
We propose the first multi-frame video object detection framework trained to detect great apes. It is applicable to challenging camera trap footage in complex jungle environments and extends a traditional feature pyramid architecture by adding self-attention driven feature blending in both the spatial as well as the temporal domain. We demonstrate that this extension can detect distinctive species appearance and motion signatures despite significant partial occlusion. We evaluate the framework using 500 camera trap videos of great apes from the Pan African Programme containing 180K frames, which we manually annotated with accurate per-frame animal bounding boxes. These clips contain significant partial occlusions, challenging lighting, dynamic backgrounds, and natural camouflage effects. We show that our approach performs highly robustly and significantly outperforms frame-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
