ArtiMuse: Fine-Grained Image Aesthetics Assessment with Joint Scoring and Expert-Level Understanding
Shuo Cao, Nan Ma, Jiayang Li, Xiaohui Li, Lihao Shao, Kaiwen Zhu, Yu Zhou, Yuandong Pu, Jiarui Wu, Jiaquan Wang, Bo Qu, Wenhai Wang, Yu Qiao, Dajuin Yao, Yihao Liu

TL;DR
ArtiMuse is a novel multimodal large language model for fine-grained image aesthetics assessment, providing both quantitative scores and expert-level attribute understanding, supported by a new expert-curated dataset.
Contribution
The paper introduces ArtiMuse, a joint scoring and understanding model, and ArtiMuse-10K, the first expert-annotated dataset for detailed aesthetic evaluation.
Findings
ArtiMuse outperforms traditional methods in aesthetic assessment.
The dataset enables fine-grained attribute analysis.
Model and dataset will be publicly available.
Abstract
The rapid advancement of educational applications, artistic creation, and AI-generated content (AIGC) technologies has substantially increased practical requirements for comprehensive Image Aesthetics Assessment (IAA), particularly demanding methods capable of delivering both quantitative scoring and professional understanding. Multimodal Large Language Model (MLLM)-based IAA methods demonstrate stronger perceptual and generalization capabilities compared to traditional approaches, yet they suffer from modality bias (score-only or text-only) and lack fine-grained attribute decomposition, thereby failing to support further aesthetic assessment. In this paper, we present:(1) ArtiMuse, an innovative MLLM-based IAA model with Joint Scoring and Expert-Level Understanding capabilities; (2) ArtiMuse-10K, the first expert-curated image aesthetic dataset comprising 10,000 images spanning 5 main…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The paper's problem definition is clear and highly relevant, addressing the growing demand for interpretable and quantitative Image Aesthetics Assessment (IAA) in fields like education, content creation, and AIGC quality control. The newly introduced ArtiMuse-10K dataset is a valuable contribution, offering superior coverage and annotation granularity with its five main categories, 15 subcategories, and eight expert-annotated aesthetic attributes, which is a significant improvement over existing
Insufficient evaluation rigor The core claim is expert-level textual analysis, yet the assessment leans heavily on another LLM, Gemini-2.0-flash, as the judge. This setup invites evaluation bias, and the paper does not report how closely the LLM’s judgments align with human experts. Without agreement statistics or confidence intervals, the validity of the text-quality claims remains uncertain. The dataset story is also thin. There is no clear annotation protocol, no inter-rater reliability metri
1. This paper introduce a large-scale, expert-labeled dataset that provides valuable resources for advancing research in aesthetic image understanding. 2. The proposed *Token As Score* strategy is simple yet effective, enabling the model to handle continuous aesthetic scoring naturally. 3. Extensive experiments across multiple datasets convincingly demonstrate the effectiveness and robustness of ArtiMuse.
1. Incomplete review of related work. For example, AesExpert [1] provides a dataset of 21,904 images across three categories with multiple annotated aesthetic attributes. [1] Huang Y, Sheng X, Yang Z, et al. Aesexpert: Towards multi-modality foundation model for image aesthetics perception[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 5911-5920. 2. The rationality of Token selection strategy needs further explanation. The paper proposes using double-letter comb
1. It introduces an expert-annotated, multidimensional aesthetic dataset (ArtiMuse-10K). 2. The model, ArtiMuse, provides both holistic scores and fine-grained attribute analysis. 3. The model enhances interpretability and professional-level understanding. 4. The model addresses modality bias of prior score-only or text-only models.
1. One of the most puzzling aspects of this work is that it mixes different types of images into a single dataset. It remains unclear whether the learned “aesthetic understanding” truly reflects aesthetic principles or merely fits the dataset distribution. For instance, are the scoring standards for Children’s Paintings and Chinese Paintings the same? Although the model performs well during inference without explicit category labels, an additional experiment involving category classification wou
1. The expert-annotated dataset (ArtiMuse-10K), featuring professional annotations across many fine-grained attributes (e.g., composition, technical execution, creativity), addresses the issues of coarse granularity, and lack of expert guidance in existing IAA datasets. 2. The ArtiMuse model successfully integrates quantitative scoring with qualitative interpretation. 3. The Token As Score strategy resolves the inherent limitation of MLLMs in performing continuous score prediction by densely m
1.Data Limitations and Distribution: Is a quantity of 10k sufficient, considering the high-dimensional nature of aesthetic assessment? The paper needs to clarify whether the data distribution across different categories is uniform or long-tail, and address the potential risks of data bias. 2.Limited Potential for Foundation Model Enhancement: The fine-tuning process is focused solely on the score prediction task. The paper does not explore or demonstrate the capacity of this task to RL and posi
1. The paper is clearly and coherently written. 2. Its proposed ArtiMuse-10K dataset is meaningful for the IAA field.
1. In Section 4.2 (lines 273-276), the paper uses a score-guided approach to generate aesthetic captions, which has been used in UniQA [1]. The paper does not properly cite this research. 2. Experiments (a), (b), and (c) should discuss the impact of data size on the results. Images with score annotations may be numerous, so they have a greater impact on the results. 3. I would like to know whether the first stage text training is effective for other methods in the second stage (Text As Score,
Code & Models
- 🤗Thunderbolt215215/ArtiMusemodel· 1.5k dl· ♡ 81.5k dl♡ 8
- 🤗Thunderbolt215215/ArtiMuse_AVAmodel· 6 dl6 dl
- 🤗Thunderbolt215215/ArtiMuse_FLICKR-AESmodel· 3 dl3 dl
- 🤗Thunderbolt215215/ArtiMuse_PARAmodel· 1 dl1 dl
- 🤗Thunderbolt215215/ArtiMuse_TAD66Kmodel· 3 dl3 dl
- 🤗Thunderbolt215215/UniPerceptmodel· 1.2k dl· ♡ 101.2k dl♡ 10
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAesthetic Perception and Analysis · Visual Attention and Saliency Detection · Ethics, Aesthetics, and Art
