Venus: Benchmarking and Empowering Multimodal Large Language Models for Aesthetic Guidance and Cropping
Tianxiang Du, Hulingxiao He, Yuxin Peng

TL;DR
Venus introduces a new framework and dataset for enhancing multimodal large language models with aesthetic guidance and cropping abilities, enabling more professional and interactive photo refinement.
Contribution
We present AesGuide, the first large-scale aesthetic guidance dataset, and Venus, a two-stage framework that significantly improves AG and cropping in multimodal models.
Findings
Venus achieves state-of-the-art aesthetic cropping performance.
The framework enables interpretable and interactive aesthetic refinement.
Extensive experiments validate the effectiveness of Venus.
Abstract
The widespread use of smartphones has made photography ubiquitous, yet a clear gap remains between ordinary users and professional photographers, who can identify aesthetic issues and provide actionable shooting guidance during capture. We define this capability as aesthetic guidance (AG) -- an essential but largely underexplored domain in computational aesthetics. Existing multimodal large language models (MLLMs) primarily offer overly positive feedback, failing to identify issues or provide actionable guidance. Without AG capability, they cannot effectively identify distracting regions or optimize compositional balance, thus also struggling in aesthetic cropping, which aims to refine photo composition through reframing after capture. To address this, we introduce AesGuide, the first large-scale AG dataset and benchmark with 10,748 photos annotated with aesthetic scores, analyses, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Aesthetic Perception and Analysis · Generative Adversarial Networks and Image Synthesis
