JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization

Yunlong Lin; Linqing Wang; Kunjie Lin; Zixu Lin; Kaixiong Gong; Wenbo Li; Bin Lin; Zhenxi Li; Shiyi Zhang; Yuyang Peng; Wenxun Dai; Xinghao Ding; Chunyu Wang; Qinglin Lu

arXiv:2511.23002·cs.CV·December 5, 2025

JarvisEvo: Towards a Self-Evolving Photo Editing Agent with Synergistic Editor-Evaluator Optimization

Yunlong Lin, Linqing Wang, Kunjie Lin, Zixu Lin, Kaixiong Gong, Wenbo Li, Bin Lin, Zhenxi Li, Shiyi Zhang, Yuyang Peng, Wenxun Dai, Xinghao Ding, Chunyu Wang, Qinglin Lu

PDF

Open Access 1 Models 1 Datasets

TL;DR

JarvisEvo introduces a self-evolving photo editing agent that combines multimodal reasoning and a synergistic policy framework to improve editing accuracy and reduce reward hacking, outperforming existing models on key metrics.

Contribution

It presents a novel unified agent with interleaved multimodal reasoning and self-improvement capabilities, advancing photo editing AI beyond prior instruction-following models.

Findings

01

Outperforms Nano-Banana by 18.95% on editing metrics

02

Achieves 44.96% improvement in pixel-level content fidelity

03

Demonstrates effective self-improvement without external rewards

Abstract

Agent-based editing models have substantially advanced interactive experiences, processing quality, and creative flexibility. However, two critical challenges persist: (1) instruction hallucination, text-only chain-of-thought (CoT) reasoning cannot fully prevent factual errors due to inherent information bottlenecks; (2) reward hacking, dynamic policy optimization against static reward models allows agents to exploit flaws in reward functions. To address these issues, we propose JarvisEvo, a unified image editing agent that emulates an expert human designer by iteratively editing, selecting appropriate tools, evaluating results, and reflecting on its own decisions to refine outcomes. JarvisEvo offers three key advantages: (1) an interleaved multimodal chain-of-thought (iMCoT) reasoning mechanism that enhances instruction following and editing quality; (2) a synergistic editor-evaluator…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
JarvisEvo/JarvisEvo
model· 934 dl· ♡ 6
934 dl♡ 6

Datasets

JarvisEvo/ArtEdit-Bench
dataset· 35 dl
35 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Digital Humanities and Scholarship · Innovative Human-Technology Interaction