Bridging Cognitive Gap: Hierarchical Description Learning for Artistic Image Aesthetics Assessment
Henglin Liu, Nisha Huang, Chang Liu, Jiangpeng Yan, Huijuan Huang, Jixuan Ying, Tong-Yee Lee, Pengfei Wan, Xiangyang Ji

TL;DR
This paper introduces a large-scale multi-dimensional aesthetic description dataset and a novel assessment framework that leverages joint description generation and large language models to improve artistic image aesthetic evaluation, reducing training time.
Contribution
The paper presents RAD, a scalable multi-dimensional aesthetic dataset, and ArtQuant, a new assessment framework integrating description generation and LLMs, advancing aesthetic evaluation methods.
Findings
Achieved state-of-the-art performance on multiple datasets.
Reduced training epochs by 67%, enhancing efficiency.
Provided a scalable, multi-dimensional aesthetic dataset.
Abstract
The aesthetic quality assessment task is crucial for developing a human-aligned quantitative evaluation system for AIGC. However, its inherently complex nature, spanning visual perception, cognition, and emotion, poses fundamental challenges. Although aesthetic descriptions offer a viable representation of this complexity, two critical challenges persist: (1) data scarcity and imbalance: existing dataset overly focuses on visual perception and neglects deeper dimensions due to the expensive manual annotation; and (2) model fragmentation: current visual networks isolate aesthetic attributes with multi-branch encoder, while multimodal methods represented by contrastive learning struggle to effectively process long-form textual descriptions. To resolve challenge (1), we first present the Refined Aesthetic Description (RAD) dataset, a large-scale (70k), multi-dimensional structured dataset,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAesthetic Perception and Analysis · Visual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis
