Why Do Vision Language Models Struggle To Recognize Human Emotions?

Madhav Agarwal; Sotirios A. Tsaftaris; Laura Sevilla-Lara; Steven McDonagh

arXiv:2604.15280·cs.CV·April 17, 2026

Why Do Vision Language Models Struggle To Recognize Human Emotions?

Madhav Agarwal, Sotirios A. Tsaftaris, Laura Sevilla-Lara, Steven McDonagh

PDF

TL;DR

This paper investigates why vision-language models struggle with recognizing human emotions, identifying issues with data bias and temporal information representation, and proposes strategies to improve emotion recognition performance.

Contribution

It reveals the vulnerabilities of VLMs in emotion recognition due to data bias and temporal limitations, and introduces novel sampling and context enrichment methods.

Findings

01

VLMs tend to collapse rare emotions into common categories due to data bias.

02

Sparse temporal sampling misaligns with micro-expressions, missing critical emotional cues.

03

Multi-stage context enrichment improves emotion recognition by integrating in-between frame information.

Abstract

Understanding emotions is a fundamental ability for intelligent systems to be able to interact with humans. Vision-language models (VLMs) have made tremendous progress in the last few years for many visual tasks, potentially offering a promising solution for understanding emotions. However, it is surprising that even the most sophisticated contemporary VLMs struggle to recognize human emotions or to outperform even specialized vision-only classifiers. In this paper we ask the question "Why do VLMs struggle to recognize human emotions?", and observe that the inherently continuous and dynamic task of facial expression recognition (DFER) exposes two critical VLM vulnerabilities. First, emotion datasets are naturally long-tailed, and the web-scale data used to pre-train VLMs exacerbates this head-class bias, causing them to systematically collapse rare, under-represented emotions into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.