CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference
Zhitong Dong, Chao Li, Jie Yu, Hao Chen

TL;DR
This paper introduces CROP, a novel image cropping method that uses multimodal reasoning and expert preference alignment to produce more human-like aesthetic crops by understanding scene composition.
Contribution
It reformulates aesthetic cropping as a reasoning task, enabling VLMs to analyze and reason about scene composition and align with human expert preferences.
Findings
CROP outperforms existing methods on multiple datasets.
The reasoning-based approach improves compositional trade-offs.
Expert preference alignment enhances aesthetic quality of crops.
Abstract
Aesthetic image cropping aims to enhance the aesthetic quality of an image by improving its composition through spatial cropping. Previous methods often rely on saliency prediction or retrieval augmentation, ignoring the task's core requirement: a deep understanding of composition and aesthetics. Consequently, saliency-based methods struggle to make compositional trade-offs in complex scenes, while retrieval-based methods blindly refer to similar cases, lacking adaptive reasoning for unique scenes. Both approaches fail to align their automated cropping results with those of human experts. To address the above issues, we propose a novel paradigm that reformulates aesthetic cropping as a multimodal reasoning task, aiming to activate the VLM's analytical and comprehension capabilities in aesthetics. We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
