From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images
Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr

TL;DR
This paper introduces CogIP-Bench, a benchmark for evaluating how well multimodal large language models align with human perceptions of images' subjective qualities, and demonstrates a post-training method to improve this alignment for better human-centric AI applications.
Contribution
The paper presents a new benchmark for subjective image properties, a post-training approach to improve model alignment with human perception, and shows transferability to creative image generation tasks.
Findings
Current models poorly align with human perception of image qualities.
Post-training significantly improves model alignment with human judgments.
Aligned models enhance creative image synthesis with desired traits.
Abstract
While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
