Multitwine: Multi-Object Compositing with Text and Layout Control
Gemma Canet Tarr\'es, Zhe Lin, Zhifei Zhang, He Zhang, Andrew Gilbert,, John Collomosse, Soo Ye Kim

TL;DR
Multitwine is a novel generative model that enables multi-object scene compositing guided by text and layout, supporting complex interactions and autonomous prop generation, with state-of-the-art performance.
Contribution
It introduces the first model for multi-object compositing with text and layout control, combining compositing and subject-driven generation in a unified framework.
Findings
Achieves state-of-the-art results in multi-object compositing.
Supports complex interactions and autonomous prop generation.
Uses a new data synthesis pipeline for training.
Abstract
We introduce the first generative model capable of simultaneous multi-object compositing, guided by both text and layout. Our model allows for the addition of multiple objects within a scene, capturing a range of interactions from simple positional relations (e.g., next to, in front of) to complex actions requiring reposing (e.g., hugging, playing guitar). When an interaction implies additional props, like `taking a selfie', our model autonomously generates these supporting objects. By jointly training for compositing and subject-driven generation, also known as customization, we achieve a more balanced integration of textual and visual inputs for text-driven object compositing. As a result, we obtain a versatile model with state-of-the-art performance in both tasks. We further present a data generation pipeline leveraging visual and language models to effortlessly synthesize…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology
