Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

Fan Zhang; Haoxuan Li; Shengju Qian; Xin Wang; Zheng Lian; Hao Wu; Zhihong Zhu; Yuan Gao; Qiankun Li; Yefeng Zheng; Zhouchen Lin; Pheng-Ann Heng

arXiv:2511.00389·cs.CV·November 4, 2025

Rethinking Facial Expression Recognition in the Era of Multimodal Large Language Models: Benchmark, Datasets, and Beyond

Fan Zhang, Haoxuan Li, Shengju Qian, Xin Wang, Zheng Lian, Hao Wu, Zhihong Zhu, Yuan Gao, Qiankun Li, Yefeng Zheng, Zhouchen Lin, Pheng-Ann Heng

PDF

Open Access

TL;DR

This paper benchmarks and enhances multimodal large language models for facial expression recognition by converting datasets into VQA format, introducing new datasets and a unified model that improves interpretability and reasoning.

Contribution

It introduces FERBench benchmark, new datasets UniFER-CoT-230K and UniFER-RLVR-360K, and a unified FER foundation model UniFER-7B with improved reasoning capabilities.

Findings

01

MLLMs show good classification but limited reasoning in FER

02

Proposed post-training strategies improve reasoning in MLLMs

03

UniFER-7B outperforms several existing generalist models

Abstract

Multimodal Large Language Models (MLLMs) have revolutionized numerous research fields, including computer vision and affective computing. As a pivotal challenge in this interdisciplinary domain, facial expression recognition (FER) has evolved from separate, domain-specific models to more unified approaches. One promising avenue to unify FER tasks is converting conventional FER datasets into visual question-answering (VQA) formats, enabling the direct application of powerful generalist MLLMs for inference. However, despite the success of cutting-edge MLLMs in various tasks, their performance on FER tasks remains largely unexplored. To address this gap, we provide FERBench, a systematic benchmark that incorporates 20 state-of-the-art MLLMs across four widely used FER datasets. Our results reveal that, while MLLMs exhibit good classification performance, they still face significant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Multimodal Machine Learning Applications · Face recognition and analysis