GPT-4o: Visual perception performance of multimodal large language   models in piglet activity understanding

Yiqi Wu; Xiaodan Hu; Ziming Fu; Siling Zhou; Jiangong Li

arXiv:2406.09781·cs.CV·June 17, 2024·6 cites

GPT-4o: Visual perception performance of multimodal large language models in piglet activity understanding

Yiqi Wu, Xiaodan Hu, Ziming Fu, Siling Zhou, Jiangong Li

PDF

Open Access

TL;DR

This study evaluates the visual perception abilities of multimodal large language models in recognizing piglet activities, highlighting GPT-4o's superior performance and suggesting future improvements for animal behavior understanding in livestock videos.

Contribution

It provides the first comprehensive assessment of multimodal LLMs in animal activity recognition, demonstrating GPT-4o's promising capabilities and identifying areas for enhancement.

Findings

01

GPT-4o outperforms other models in piglet activity recognition

02

Current models need improvement in semantic correspondence and time perception

03

Multimodal LLMs show potential for livestock video understanding

Abstract

Animal ethology is an crucial aspect of animal research, and animal behavior labeling is the foundation for studying animal behavior. This process typically involves labeling video clips with behavioral semantic tags, a task that is complex, subjective, and multimodal. With the rapid development of multimodal large language models(LLMs), new application have emerged for animal behavior understanding tasks in livestock scenarios. This study evaluates the visual perception capabilities of multimodal LLMs in animal activity recognition. To achieve this, we created piglet test data comprising close-up video clips of individual piglets and annotated full-shot video clips. These data were used to assess the performance of four multimodal LLMs-Video-LLaMA, MiniGPT4-Video, Video-Chat2, and GPT-4 omni (GPT-4o)-in piglet activity understanding. Through comprehensive evaluation across five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnimal Behavior and Welfare Studies · Animal Vocal Communication and Behavior · Primate Behavior and Ecology

MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Label Smoothing · Adam · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer