From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

Yiming Chen; Junlin Han; Tianyi Bai; Shengbang Tong; Filippos Kokkinos; Philip Torr

arXiv:2511.22805·cs.CV·December 1, 2025

From Pixels to Feelings: Aligning MLLMs with Human Cognitive Perception of Images

Yiming Chen, Junlin Han, Tianyi Bai, Shengbang Tong, Filippos Kokkinos, Philip Torr

PDF

Open Access

TL;DR

This paper introduces CogIP-Bench, a benchmark for evaluating how well multimodal large language models align with human perceptions of images' subjective qualities, and demonstrates a post-training method to improve this alignment for better human-centric AI applications.

Contribution

The paper presents a new benchmark for subjective image properties, a post-training approach to improve model alignment with human perception, and shows transferability to creative image generation tasks.

Findings

01

Current models poorly align with human perception of image qualities.

02

Post-training significantly improves model alignment with human judgments.

03

Aligned models enhance creative image synthesis with desired traits.

Abstract

While Multimodal Large Language Models (MLLMs) are adept at answering what is in an image-identifying objects and describing scenes-they often lack the ability to understand how an image feels to a human observer. This gap is most evident when considering subjective cognitive properties, such as what makes an image memorable, funny, aesthetically pleasing, or emotionally evocative. To systematically address this challenge, we introduce CogIP-Bench, a comprehensive benchmark for evaluating MLLMs on such image cognitive properties. Our evaluation reveals a significant gap: current models are poorly aligned with human perception of these nuanced properties. We then demonstrate that a post-training phase can effectively bridge this gap, significantly enhancing the model's alignment with human judgments. Furthermore, we show that this learned cognitive alignment is not merely predictive but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)