MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

Niladri Shekhar Dutt; Duygu Ceylan; Niloy J. Mitra

arXiv:2505.06176·cs.GR·May 12, 2025

MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

Niladri Shekhar Dutt, Duygu Ceylan, Niloy J. Mitra

PDF

1 Models

TL;DR

MonetGPT trains multimodal large language models to critique, plan, and execute image retouching operations, improving explainability and object preservation in photo editing tasks.

Contribution

It introduces a method to teach MLLMs to understand and perform procedural image retouching by training on a synthesized reasoning dataset from expert edits.

Findings

01

MLLM can effectively critique and suggest retouching operations.

02

The approach preserves object identity and details.

03

It outperforms existing generative and procedural methods in explainability.

Abstract

Retouching is an essential task in post-manipulation of raw photographs. Generative editing, guided by text or strokes, provides a new tool accessible to users but can easily change the identity of the original objects in unacceptable and unpredictable ways. In contrast, although traditional procedural edits, as commonly supported by photoediting tools (e.g., Gimp, Lightroom), are conservative, they are still preferred by professionals. Unfortunately, professional quality retouching involves many individual procedural editing operations that is challenging to plan for most novices. In this paper, we ask if a multimodal large language model (MLLM) can be taught to critique raw photographs, suggest suitable remedies, and finally realize them with a given set of pre-authored procedural image operations. We demonstrate that MLLMs can be first made aware of the underlying image processing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
niladridutt/monetGPT
model· 20 dl· ♡ 1
20 dl♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttentive Walk-Aggregating Graph Neural Network · Sparse Evolutionary Training