Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Yichen Feng; Yuetai Li; Chunjiang Liu; Yuanyuan Chen; Fengqing Jiang; Yue Huang; Hang Hua; Zhengqing Yuan; Kaiyuan Zheng; Luyao Niu; Bhaskar Ramasubramanian; Basel Alomair; Xiangliang Zhang; Misha Sra; Zichen Chen; Radha Poovendran; Zhangchen Xu

arXiv:2605.12684·cs.CV·May 14, 2026

Visual Aesthetic Benchmark: Can Frontier Models Judge Beauty?

Yichen Feng, Yuetai Li, Chunjiang Liu, Yuanyuan Chen, Fengqing Jiang, Yue Huang, Hang Hua, Zhengqing Yuan, Kaiyuan Zheng, Luyao Niu, Bhaskar Ramasubramanian, Basel Alomair, Xiangliang Zhang, Misha Sra, Zichen Chen, Radha Poovendran, Zhangchen Xu

PDF

1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces the Visual Aesthetic Benchmark (VAB) to evaluate whether multimodal large language models can accurately judge visual beauty through comparative selection, revealing a significant gap compared to expert human judgment.

Contribution

The paper presents VAB, a new benchmark for aesthetic evaluation based on comparative selection, and evaluates frontier models, highlighting their limitations and the benefits of fine-tuning.

Findings

01

Current models identify best/worst images correctly only 26.5% of the time, below human performance of 68.9%.

02

Fine-tuning improves model accuracy, approaching human-level performance.

03

VAB exposes the gap between AI models and expert aesthetic judgment, providing a new evaluation standard.

Abstract

Multimodal large language models (MLLMs) are now routinely deployed for visual understanding, generation, and curation. A substantial fraction of these applications require an explicit aesthetic judgment. Most existing solutions reduce this judgment to predicting a scalar score for a single image. We first ask whether such scores faithfully capture comparative preference: in a controlled study with eight expert annotators, score-derived rankings align poorly with the same annotators' direct comparisons, while direct ranking yields substantially higher inter-annotator agreement on best- and worst-image labels. Motivated by this finding, we introduce the Visual Aesthetic Benchmark (VAB), which casts aesthetic evaluation as comparative selection over candidate sets with matched subject matter. VAB contains 400 tasks and 1,195 images across fine art, photography, and illustration, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

bakelab/Visual-Aesthetic-Benchmark
github

Models

🤗
BakeLab/Kallisti-35B-A3B
model· 53 dl· ♡ 2
53 dl♡ 2

Datasets

BakeLab/Visual-Aesthetic-Benchmark
dataset· 79 dl
79 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.