Multimodal Prompt Alignment for Facial Expression Recognition

Fuyan Ma; Yiran He; Bin Sun; Shutao Li

arXiv:2506.21017·cs.CV·June 27, 2025

Multimodal Prompt Alignment for Facial Expression Recognition

Fuyan Ma, Yiran He, Bin Sun, Shutao Li

PDF

Open Access

TL;DR

This paper introduces MPA-FER, a multimodal prompt alignment framework that enhances facial expression recognition by leveraging detailed descriptions from large language models and aligning visual features with class prototypes for better interpretability and accuracy.

Contribution

The paper proposes a novel multimodal prompt alignment method that integrates LLM-generated detailed prompts and prototype-guided feature alignment to improve FER performance.

Findings

01

Outperforms state-of-the-art on three FER datasets

02

Maintains pretrained model benefits with minimal extra computation

03

Provides more interpretable facial expression representations

Abstract

Prompt learning has been widely adopted to efficiently adapt vision-language models (VLMs) like CLIP for various downstream tasks. Despite their success, current VLM-based facial expression recognition (FER) methods struggle to capture fine-grained textual-visual relationships, which are essential for distinguishing subtle differences between facial expressions. To address this challenge, we propose a multimodal prompt alignment framework for FER, called MPA-FER, that provides fine-grained semantic guidance to the learning process of prompted visual features, resulting in more precise and interpretable representations. Specifically, we introduce a multi-granularity hard prompt generation strategy that utilizes a large language model (LLM) like ChatGPT to generate detailed descriptions for each facial expression. The LLM-based external knowledge is injected into the soft prompts by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition

MethodsContrastive Language-Image Pre-training · ALIGN