PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

Zewei Chang; Zheng-Peng Duan; Jianxing Zhang; Chun-Le Guo; Siyu Liu; Hyungju Chun; Hyunhee Park; Zikun Liu; Chongyi Li

arXiv:2511.12998·cs.CV·December 19, 2025

PerTouch: VLM-Driven Agent for Personalized and Semantic Image Retouching

Zewei Chang, Zheng-Peng Duan, Jianxing Zhang, Chun-Le Guo, Siyu Liu, Hyungju Chun, Hyunhee Park, Zikun Liu, Chongyi Li

PDF

Open Access

TL;DR

PerTouch is a diffusion-based framework that enables personalized, semantic-aware image retouching by integrating visual language models, semantic boundary handling, and user feedback mechanisms for improved control and aesthetic alignment.

Contribution

It introduces a unified diffusion model with semantic control, a VLM-driven agent for natural language interaction, and feedback mechanisms for personalized image retouching.

Findings

01

Outperforms existing methods in personalized retouching quality

02

Effective semantic boundary perception improves retouching precision

03

VLM-driven agent successfully interprets user instructions

Abstract

Image retouching aims to enhance visual quality while aligning with users' personalized aesthetic preferences. To address the challenge of balancing controllability and subjectivity, we propose a unified diffusion-based image retouching framework called PerTouch. Our method supports semantic-level image retouching while maintaining global aesthetics. Using parameter maps containing attribute values in specific semantic regions as input, PerTouch constructs an explicit parameter-to-image mapping for fine-grained image retouching. To improve semantic boundary perception, we introduce semantic replacement and parameter perturbation mechanisms during training. To connect natural language instructions with visual control, we develop a VLM-driven agent to handle both strong and weak user instructions. Equipped with mechanisms of feedback-driven rethinking and scene-aware memory, PerTouch…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis