Multimodal LLMs Can Reason about Aesthetics in Zero-Shot
Ruixiang Jiang, Changwen Chen

TL;DR
This paper explores how multimodal large language models can be prompted to perform aesthetic judgments in art, revealing their reasoning process and addressing hallucinations to better align with human aesthetic understanding.
Contribution
It introduces an evidence-based prompting method, ArtCoT, that enhances MLLMs' aesthetic reasoning, reducing hallucinations and improving alignment with human judgments.
Findings
MLLMs can perform aesthetic reasoning with proper prompting.
Hallucinations in MLLMs can be mitigated through evidence-based prompts.
Enhanced reasoning aligns better with human aesthetic judgments.
Abstract
The rapid technical progress of generative art (GenArt) has democratized the creation of visually appealing imagery. However, achieving genuine artistic impact - the kind that resonates with viewers on a deeper, more meaningful level - remains formidable as it requires a sophisticated aesthetic sensibility. This sensibility involves a multifaceted cognitive process extending beyond mere visual appeal, which is often overlooked by current computational methods. This paper pioneers an approach to capture this complex process by investigating how the reasoning capabilities of Multimodal LLMs (MLLMs) can be effectively elicited to perform aesthetic judgment. Our analysis reveals a critical challenge: MLLMs exhibit a tendency towards hallucinations during aesthetic reasoning, characterized by subjective opinions and unsubstantiated artistic interpretations. We further demonstrate that these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsALIGN
