EmoGist: Efficient In-Context Learning for Visual Emotion Understanding
Ronald Seoh, Dan Goldwasser

TL;DR
EmoGist is a training-free in-context learning approach that improves visual emotion classification by using context-dependent label descriptions and image clustering, significantly enhancing accuracy across multiple datasets.
Contribution
We propose EmoGist, a novel method that leverages label descriptions and image clustering for improved emotion recognition without additional training.
Findings
Up to 12 points improvement in micro F1 scores on Memotion dataset.
Up to 8 points improvement in macro F1 scores on FI dataset.
Effective in multi-label and multi-class emotion classification tasks.
Abstract
In this paper, we introduce EmoGist, a training-free, in-context learning method for performing visual emotion classification with LVLMs. The key intuition of our approach is that context-dependent definition of emotion labels could allow more accurate predictions of emotions, as the ways in which emotions manifest within images are highly context dependent and nuanced. EmoGist pre-generates multiple descriptions of emotion labels, by analyzing the clusters of example images belonging to each label. At test time, we retrieve a version of description based on the cosine similarity of test image to cluster centroids, and feed it together with the test image to a fast LVLM for classification. Through our experiments, we show that EmoGist allows up to 12 points improvement in micro F1 scores with the multi-label Memotion dataset, and up to 8 points in macro F1 in the multi-class FI dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Human Pose and Action Recognition · Emotion and Mood Recognition
