CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

Zhitong Dong; Chao Li; Jie Yu; Hao Chen

arXiv:2605.12545·cs.CV·May 14, 2026

CROP: Expert-Aligned Image Cropping via Compositional Reasoning and Optimizing Preference

Zhitong Dong, Chao Li, Jie Yu, Hao Chen

PDF

TL;DR

This paper introduces CROP, a novel image cropping method that uses multimodal reasoning and expert preference alignment to produce more human-like aesthetic crops by understanding scene composition.

Contribution

It reformulates aesthetic cropping as a reasoning task, enabling VLMs to analyze and reason about scene composition and align with human expert preferences.

Findings

01

CROP outperforms existing methods on multiple datasets.

02

The reasoning-based approach improves compositional trade-offs.

03

Expert preference alignment enhances aesthetic quality of crops.

Abstract

Aesthetic image cropping aims to enhance the aesthetic quality of an image by improving its composition through spatial cropping. Previous methods often rely on saliency prediction or retrieval augmentation, ignoring the task's core requirement: a deep understanding of composition and aesthetics. Consequently, saliency-based methods struggle to make compositional trade-offs in complex scenes, while retrieval-based methods blindly refer to similar cases, lacking adaptive reasoning for unique scenes. Both approaches fail to align their automated cropping results with those of human experts. To address the above issues, we propose a novel paradigm that reformulates aesthetic cropping as a multimodal reasoning task, aiming to activate the VLM's analytical and comprehension capabilities in aesthetics. We design a Compositional Reasoning and Optimizing Preference method (CROP) that directs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.