# Investigating the capabilities of large vision language models in dog emotion recognition

**Authors:** George Martvel, Anna Zamansky, Ilan Shimshoni, Annika Bremhorst

PMC · DOI: 10.1038/s41598-025-25199-7 · Scientific Reports · 2025-11-21

## TL;DR

This paper evaluates how well large vision-language models recognize dog emotions and finds they rely on superficial cues rather than true understanding.

## Contribution

The study introduces a novel evaluation of LVLMs for dog emotion recognition using controlled and uncontrolled datasets.

## Key findings

- LVLMs show moderate accuracy on web-sourced dog emotion data but likely rely on background cues.
- Performance drops to near chance on experimentally controlled datasets with minimal context.
- Prompt variations do not significantly improve classification accuracy.

## Abstract

Identifying emotional states in animals is a key challenge in behavioural science and a prerequisite for developing reliable welfare assessments, ethical frameworks, and robust human–animal communication models. Recently, large vision-language models (LVLMs) such as GPT-4o, Gemini, and LLaVA have shown promise in general image understanding tasks, and are beginning to be applied for emotion recognition in animals. In this study, we critically evaluated the ability of state-of-the-art LVLMs to classify emotional states in dogs using a zero-shot approach. We assessed model performance on two datasets: (1) the Dog Emotions (DE) dataset, consisting of web-sourced images with layperson-generated emotion labels, and (2) the Labrador Retriever cropped-face (LRc) dataset, which stems from a rigorously controlled experimental study where emotional states were systematically elicited in dogs and defined based on the experimental context in canine emotion research. Our results revealed that while LVLMs showed moderate classification accuracy on DE, performance is likely driven by superficial correlations, such as background context and breed morphology. When evaluated on LRc, where emotional states are experimentally induced and backgrounds are minimal, performance dropped to near-chance levels, indicating limited ability to generalise based on biologically relevant cues. Background manipulation experiments further confirmed that models relied heavily on contextual features. Prompt variation and system-level instructions slightly improved response rates but did not enhance classification accuracy. These findings highlight significant limitations in the current application of LVLMs to non-human species and raise ethical and epistemological concerns regarding potential anthropocentric biases embedded in their training data. We advocate for species-sensitive AI approaches grounded in validated behavioural science, emphasising the need for high-quality, preferably experimentally-based multimodal datasets and more transparent validation. Our study underscores both the potential and the risks of using general-purpose AI to infer internal states in animals and calls for rigorous, interdisciplinary development of animal-centred computational approaches.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606], Canis lupus familiaris (dog, subspecies) [taxon 9615]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12638295/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12638295/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/PMC12638295/full.md

---
Source: https://tomesphere.com/paper/PMC12638295