TL;DR
This paper introduces a GAN-based approach that treats text as neural operators to enable complex, multi-object image editing guided by natural language instructions, improving fidelity and semantic relevance.
Contribution
It proposes a novel method that uses text as neural operators for local image feature modification, advancing multimodal image manipulation techniques.
Findings
Outperforms recent baselines on three datasets
Generates images with higher fidelity and semantic relevance
Enhances image retrieval performance
Abstract
In recent years, text-guided image manipulation has gained increasing attention in the multimedia and computer vision community. The input to conditional image generation has evolved from image-only to multimodality. In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We propose a GAN-based method to tackle this problem. The key idea is to treat text as neural operators to locally modify the image feature. We show that the proposed model performs favorably against recent strong baselines on three public datasets. Specifically, it generates images of greater fidelity and semantic relevance, and when used as a image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
