Self-supervised Learning on Camera Trap Footage Yields a Strong Universal Face Embedder
Vladimir Iashin, Horace Lee, Dan Schofield, Andrew Zisserman

TL;DR
This paper presents a self-supervised learning method using Vision Transformers to create a universal chimpanzee face embedder from unlabeled camera trap footage, outperforming supervised methods in re-identification tasks.
Contribution
It introduces a novel self-supervised approach for learning face embeddings from unlabeled wildlife footage, eliminating the need for manual labels and improving re-identification performance.
Findings
Outperforms supervised baselines on re-identification benchmarks
Uses only unlabeled camera trap footage for training
Demonstrates scalability for biodiversity monitoring
Abstract
Camera traps are revolutionising wildlife monitoring by capturing vast amounts of visual data; however, the manual identification of individual animals remains a significant bottleneck. This study introduces a fully self-supervised approach to learning robust chimpanzee face embeddings from unlabeled camera-trap footage. Leveraging the DINOv2 framework, we train Vision Transformers on automatically mined face crops, eliminating the need for identity labels. Our method demonstrates strong open-set re-identification performance, surpassing supervised baselines on challenging benchmarks such as Bossou, despite utilising no labelled data during training. This work underscores the potential of self-supervised learning in biodiversity monitoring and paves the way for scalable, non-invasive population studies.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Video Surveillance and Tracking Methods · Face and Expression Recognition
