PhotoFramer: Multi-modal Image Composition Instruction

Zhiyuan You; Ke Wang; He Zhang; Xin Cai; Jinjin Gu; Tianfan Xue; Chao Dong; Zhoutong Zhang

arXiv:2512.00993·cs.CV·April 22, 2026

PhotoFramer: Multi-modal Image Composition Instruction

Zhiyuan You, Ke Wang, He Zhang, Xin Cai, Jinjin Gu, Tianfan Xue, Chao Dong, Zhoutong Zhang

PDF

1 Models 1 Datasets

TL;DR

PhotoFramer is a multi-modal framework that guides users in improving photo composition by describing adjustments in natural language and generating well-composed example images.

Contribution

It introduces a novel dataset and model for jointly generating textual guidance and illustrative images to assist users in composing better photographs.

Findings

01

Textual instructions effectively improve image composition.

02

Coupling text guidance with example images yields better results than using examples alone.

03

The framework demonstrates practical potential for accessible photographic assistance.

Abstract

Composition matters during the photo-taking process, yet many casual users struggle to frame well-composed images. To provide composition guidance, we introduce PhotoFramer, a multi-modal composition instruction framework. Given a poorly composed image, PhotoFramer first describes how to improve the composition in natural language and then generates a well-composed example image. To train such a model, we curate a large-scale dataset. Inspired by how humans take photos, we organize composition guidance into a hierarchy of sub-tasks: shift, zoom-in, and view-change tasks. Shift and zoom-in data are sampled from existing cropping datasets, while view-change data are obtained via a two-stage pipeline. First, we sample pairs with varying viewpoints from multi-view datasets, and train a degradation model to transform well-composed photos into poorly composed ones. Second, we apply this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
zhiyuanyou/Qwen2.5-VL-7B-GRPO-Composition-Score-Class
model· 63 dl
63 dl

Datasets

zhiyuanyou/Datasets-PhotoFramer-Assessment
dataset· 58 dl
58 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.