MOSAIC: Multi-Object Segmented Arbitrary Stylization Using CLIP
Prajwal Ganugula, Y S S S Santosh Kumar, N K Sagar Reddy, Prabhath, Chellingi, Avinash Thakur, Neeraj Kasera, C Shyam Anand

TL;DR
MOSAIC introduces a novel text-guided method for object-wise image stylization, enabling fine control over individual objects' styles based on context, surpassing previous methods in quality and flexibility.
Contribution
It is the first approach to achieve text-guided, arbitrary object-wise stylization using vision transformer-based segmentation and stylization modules.
Findings
Produces high-quality, visually appealing stylized images.
Enhances control over stylization of individual objects.
Generalizes well to unseen object classes.
Abstract
Style transfer driven by text prompts paved a new path for creatively stylizing the images without collecting an actual style image. Despite having promising results, with text-driven stylization, the user has no control over the stylization. If a user wants to create an artistic image, the user requires fine control over the stylization of various entities individually in the content image, which is not addressed by the current state-of-the-art approaches. On the other hand, diffusion style transfer methods also suffer from the same issue because the regional stylization control over the stylized output is ineffective. To address this problem, We propose a new method Multi-Object Segmented Arbitrary Stylization Using CLIP (MOSAIC), that can apply styles to different objects in the image based on the context extracted from the input prompt. Text-based segmentation and stylization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Image Processing and 3D Reconstruction
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Layer Normalization · Dense Connections · Vision Transformer · Contrastive Language-Image Pre-training · Diffusion
