Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos
Giulio Cesare Mastrocinque Santo, Patr\'icia Izar, Irene Delval, Victor de Napole Gregolin, Nina S. T. Hirata

TL;DR
This paper presents a method for fine-tuning video-text models to retrieve primate behavior clips from unlabeled videos using weak audio descriptions, significantly improving retrieval accuracy in a challenging domain.
Contribution
It introduces a novel data processing pipeline and fine-tuning approach for domain-specific primate behavior retrieval from raw videos without labeled data.
Findings
Significant improvement in Hits@5 metrics (167% and 114%)
Model effectively ranks behaviors using NDCG@K
Raw pre-trained models perform poorly on domain data
Abstract
Video recordings of nonhuman primates in their natural habitat are a common source for studying their behavior in the wild. We fine-tune pre-trained video-text foundational models for the specific domain of capuchin monkeys, with the goal of developing useful computational models to help researchers to retrieve useful clips from videos. We focus on the challenging problem of training a model based solely on raw, unlabeled video footage, using weak audio descriptions sometimes provided by field collaborators. We leverage recent advances in Multimodal Large Language Models (MLLMs) and Vision-Language Models (VLMs) to address the extremely noisy nature of both video and audio content. Specifically, we propose a two-folded approach: an agentic data treatment pipeline and a fine-tuning process. The data processing pipeline automatically extracts clean and semantically aligned video-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
MethodsFocus
