Multimodal Foundation Models for Zero-shot Animal Species Recognition in   Camera Trap Images

Zalan Fabian; Zhongqi Miao; Chunyuan Li; Yuanhan Zhang; Ziwei Liu,; Andr\'es Hern\'andez; Andr\'es Montes-Rojas; Rafael Escucha; Laura Siabatto,; Andr\'es Link; Pablo Arbel\'aez; Rahul Dodhia; Juan Lavista Ferres

arXiv:2311.01064·cs.CV·November 3, 2023·6 cites

Multimodal Foundation Models for Zero-shot Animal Species Recognition in Camera Trap Images

Zalan Fabian, Zhongqi Miao, Chunyuan Li, Yuanhan Zhang, Ziwei Liu,, Andr\'es Hern\'andez, Andr\'es Montes-Rojas, Rafael Escucha, Laura Siabatto,, Andr\'es Link, Pablo Arbel\'aez, Rahul Dodhia, Juan Lavista Ferres

PDF

Open Access

TL;DR

WildMatch is a zero-shot wildlife species recognition framework that uses multimodal foundation models and instruction tuning to identify animals in camera trap images without requiring labeled training data.

Contribution

The paper introduces WildMatch, a novel zero-shot classification method leveraging multimodal models and knowledge augmentation for wildlife monitoring.

Findings

01

Effective zero-shot species recognition demonstrated on Colombian camera trap data

02

Instruction tuning improves detailed animal description generation

03

Knowledge augmentation enhances caption quality and classification accuracy

Abstract

Due to deteriorating environmental conditions and increasing human activity, conservation efforts directed towards wildlife is crucial. Motion-activated camera traps constitute an efficient tool for tracking and monitoring wildlife populations across the globe. Supervised learning techniques have been successfully deployed to analyze such imagery, however training such techniques requires annotations from experts. Reducing the reliance on costly labelled data therefore has immense potential in developing large-scale wildlife tracking solutions with markedly less human labor. In this work we propose WildMatch, a novel zero-shot species classification framework that leverages multimodal foundation models. In particular, we instruction tune vision-language models to generate detailed visual descriptions of camera trap images using similar terminology to experts. Then, we match the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization

MethodsBalanced Selection