Entity-Level Text-Guided Image Manipulation
Yikai Wang, Jianan Wang, Guansong Lu, Hang Xu, Zhenguo Li, Wei Zhang,, and Yanwei Fu

TL;DR
This paper introduces SeMani, a novel framework for real-world, entity-level text-guided image manipulation that accurately edits and merges entities based on text descriptions while preserving irrelevant regions.
Contribution
SeMani is the first framework to perform entity-level text-guided image manipulation in real-world scenarios, combining semantic alignment with advanced generative models for precise editing.
Findings
SeMani outperforms baseline methods in accuracy and flexibility.
SeMani effectively distinguishes entity-relevant regions.
SeMani achieves zero-shot manipulation on real datasets.
Abstract
Existing text-guided image manipulation methods aim to modify the appearance of the image or to edit a few objects in a virtual or simple scenario, which is far from practical applications. In this work, we study a novel task on text-guided image manipulation on the entity level in the real world (eL-TGIM). The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the entity-irrelevant regions, and (3) to merge the manipulated entity into the image naturally. To this end, we propose an elegant framework, dubbed as SeMani, forming the Semantic Manipulation of real-world images that can not only edit the appearance of entities but also generate new entities corresponding to the text guidance. To solve eL-TGIM, SeMani decomposes the task into two phases: the semantic alignment phase and the image manipulation phase. In the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Digital Media Forensic Detection · Multimodal Machine Learning Applications
MethodsDiffusion
