TL;DR
MonetGPT trains multimodal large language models to critique, plan, and execute image retouching operations, improving explainability and object preservation in photo editing tasks.
Contribution
It introduces a method to teach MLLMs to understand and perform procedural image retouching by training on a synthesized reasoning dataset from expert edits.
Findings
MLLM can effectively critique and suggest retouching operations.
The approach preserves object identity and details.
It outperforms existing generative and procedural methods in explainability.
Abstract
Retouching is an essential task in post-manipulation of raw photographs. Generative editing, guided by text or strokes, provides a new tool accessible to users but can easily change the identity of the original objects in unacceptable and unpredictable ways. In contrast, although traditional procedural edits, as commonly supported by photoediting tools (e.g., Gimp, Lightroom), are conservative, they are still preferred by professionals. Unfortunately, professional quality retouching involves many individual procedural editing operations that is challenging to plan for most novices. In this paper, we ask if a multimodal large language model (MLLM) can be taught to critique raw photographs, suggest suitable remedies, and finally realize them with a given set of pre-authored procedural image operations. We demonstrate that MLLMs can be first made aware of the underlying image processing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttentive Walk-Aggregating Graph Neural Network · Sparse Evolutionary Training
