FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing

Tianshuo Yuan; Yuxiang Lin; Jue Wang; Zhi-Qi Cheng; Xiaolong Wang; Jiao GH; Wei Chen; Xiaojiang Peng

arXiv:2408.12429·cs.CV·July 15, 2025

FlexEdit: Marrying Free-Shape Masks to VLLM for Flexible Image Editing

Tianshuo Yuan, Yuxiang Lin, Jue Wang, Zhi-Qi Cheng, Xiaolong Wang, Jiao GH, Wei Chen, Xiaojiang Peng

PDF

Open Access 1 Repo

TL;DR

FlexEdit introduces a novel end-to-end image editing approach that combines free-shape masks with language instructions, utilizing a vision large language model and a mask enhancement adapter to improve editing accuracy and user-friendliness.

Contribution

We propose FlexEdit, which effectively integrates free-shape masks with language instructions using a VLLM and a new Mask Enhance Adapter for improved image editing performance.

Findings

01

Achieves state-of-the-art results on LLM-based image editing tasks.

02

Introduces FSMI-Edit benchmark with 8 free-shape mask types.

03

Demonstrates effectiveness of simple prompting techniques.

Abstract

Combining Vision Large Language Models (VLLMs) with diffusion models offers a powerful method for executing image editing tasks based on human language instructions. However, language instructions alone often fall short in accurately conveying user requirements, particularly when users want to add, replace elements in specific areas of an image. Luckily, masks can effectively indicate the exact locations or elements to be edited, while they require users to precisely draw the shapes at the desired locations, which is highly user-unfriendly. To address this, we propose FlexEdit, an end-to-end image editing method that leverages both free-shape masks and language instructions for Flexible Editing. Our approach employs a VLLM in comprehending the image content, mask, and user instructions. Additionally, we introduce the Mask Enhance Adapter (MEA) that fuses the embeddings of the VLLM with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

a-new-b/flex_edit
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputer Graphics and Visualization Techniques · Augmented Reality Applications

MethodsAdapter · Diffusion