Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance
Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

TL;DR
This paper introduces a new dataset and a novel training method, HCM-GRPO, to significantly improve image aesthetic reasoning in multimodal large language models, outperforming existing models with fewer resources.
Contribution
The paper presents a comprehensive image screening dataset and a novel HCM-GRPO training framework that enhances image aesthetic reasoning in multimodal models.
Findings
HCM-GRPO outperforms original GRPO in aesthetic reasoning.
State-of-the-art models perform no better than random guessing on this task.
Our approach surpasses larger models with fewer resources.
Abstract
The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
