Rodent-Bench
Thomas Heap, Laurence Aitchison, Emma Cahill, Adriana Casado Rodriguez

TL;DR
Rodent-Bench is a new benchmark for evaluating multimodal large language models' ability to annotate complex rodent behavior videos, revealing current models' limitations and guiding future improvements.
Contribution
The paper introduces Rodent-Bench, a comprehensive benchmark with standardized metrics for assessing MLLMs on rodent behavior annotation tasks.
Findings
Current MLLMs perform poorly on detailed behavioral annotation.
Models show modest success in grooming detection but struggle with temporal segmentation.
Significant challenges remain in handling extended videos and subtle behaviors.
Abstract
We present Rodent-Bench, a novel benchmark designed to evaluate the ability of Multimodal Large Language Models (MLLMs) to annotate rodent behaviour footage. We evaluate state-of-the-art MLLMs, including Gemini-2.5-Pro, Gemini-2.5-Flash and Qwen-VL-Max, using this benchmark and find that none of these models perform strongly enough to be used as an assistant for this task. Our benchmark encompasses diverse datasets spanning multiple behavioral paradigms including social interactions, grooming, scratching, and freezing behaviors, with videos ranging from 10 minutes to 35 minutes in length. We provide two benchmark versions to accommodate varying model capabilities and establish standardized evaluation metrics including second-wise accuracy, macro F1, mean average precision, mutual information, and Matthew's correlation coefficient. While some models show modest performance on certain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsZebrafish Biomedical Research Applications · EEG and Brain-Computer Interfaces · Neural dynamics and brain function
