Visual Prompting in LLMs for Enhancing Emotion Recognition
Qixuan Zhang, Zhifeng Wang, Dylan Zhang, Wenjia Niu, Sabrina Caldwell,, Tom Gedeon, Yang Liu, Zhenyue Qin

TL;DR
This paper introduces a novel visual prompting method called Set-of-Vision (SoV) that enhances emotion recognition in vision large language models by utilizing spatial cues like bounding boxes and landmarks, leading to improved accuracy.
Contribution
The paper proposes the SoV approach that incorporates spatial visual prompts into VLLMs, addressing limitations in spatial localization and global context understanding for emotion recognition.
Findings
SoV improves face count accuracy.
SoV enhances emotion categorization performance.
Spatial prompts significantly boost model understanding of facial expressions.
Abstract
Vision Large Language Models (VLLMs) are transforming the intersection of computer vision and natural language processing. Nonetheless, the potential of using visual prompts for emotion recognition in these models remains largely unexplored and untapped. Traditional methods in VLLMs struggle with spatial localization and often discard valuable global context. To address this problem, we propose a Set-of-Vision prompting (SoV) approach that enhances zero-shot emotion recognition by using spatial information, such as bounding boxes and facial landmarks, to mark targets precisely. SoV improves accuracy in face count and emotion categorization while preserving the enriched image context. Through a battery of experimentation and analysis of recent commercial or open-source VLLMs, we evaluate the SoV model's ability to comprehend facial expressions in natural environments. Our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEmotion and Mood Recognition
