Visual Prompting in LLMs for Enhancing Emotion Recognition

Qixuan Zhang; Zhifeng Wang; Dylan Zhang; Wenjia Niu; Sabrina Caldwell,; Tom Gedeon; Yang Liu; Zhenyue Qin

arXiv:2410.02244·cs.CV·October 4, 2024

Visual Prompting in LLMs for Enhancing Emotion Recognition

Qixuan Zhang, Zhifeng Wang, Dylan Zhang, Wenjia Niu, Sabrina Caldwell,, Tom Gedeon, Yang Liu, Zhenyue Qin

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel visual prompting method called Set-of-Vision (SoV) that enhances emotion recognition in vision large language models by utilizing spatial cues like bounding boxes and landmarks, leading to improved accuracy.

Contribution

The paper proposes the SoV approach that incorporates spatial visual prompts into VLLMs, addressing limitations in spatial localization and global context understanding for emotion recognition.

Findings

01

SoV improves face count accuracy.

02

SoV enhances emotion categorization performance.

03

Spatial prompts significantly boost model understanding of facial expressions.

Abstract

Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. Nonetheless, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. To address this problem, we propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through a battery of experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Visual Prompting in LLMs for Enhancing Emotion Recognition· underline

Taxonomy

TopicsEmotion and Mood Recognition