Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?
Yichen Feng, Yuetai Li, Chunjiang Liu, Yuanyuan Chen, Fengqing Jiang, Yue Huang, Hang Hua, Zhengqing Yuan, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Xiangliang Zhang, Misha Sra, Zichen Chen, Radha Poovendran, Zhangchen Xu

TL;DR
This paper introduces the Visual Aesthetic Benchmark (VAB) to evaluate whether multimodal large language models can accurately judge visual beauty through comparative selection, revealing a significant gap compared to expert human judgment.
Contribution
The paper presents VAB, a new benchmark for aesthetic evaluation based on comparative selection, and evaluates frontier models, highlighting their limitations and the benefits of fine-tuning.
Findings
Current models identify best/worst images correctly only 26.5% of the time, below human performance of 68.9%.
Fine-tuning improves model accuracy, approaching human-level performance.
VAB exposes the gap between AI models and expert aesthetic judgment, providing a new evaluation standard.
Abstract
Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
