Composing Concepts from Images and Videos via Concept-prompt Binding
Xianghao Kong, Zeyu Zhang, Yuwei Guo, Zhuoran Zhao, Songchun Zhang, Anyi Rao

TL;DR
This paper introduces Bind & Compose, a novel one-shot method for flexible visual concept composition from images and videos, utilizing hierarchical binding, diversification, and temporal disentanglement to improve accuracy and coherence.
Contribution
It proposes a new hierarchical binder structure with Diversify-and-Absorb and Temporal Disentanglement strategies for better concept binding and composition in visual inputs.
Findings
Achieves superior concept consistency and prompt fidelity.
Improves motion quality in video concept composition.
Outperforms existing methods in visual concept integration.
Abstract
Visual concept composition, which aims to integrate different elements from images and videos into a single, coherent visual output, still falls short in accurately extracting complex concepts from visual inputs and flexibly combining concepts from both images and videos. We introduce Bind & Compose, a one-shot method that enables flexible visual concept composition by binding visual concepts with corresponding prompt tokens and composing the target prompt with bound tokens from various sources. It adopts a hierarchical binder structure for cross-attention conditioning in Diffusion Transformers to encode visual concepts into corresponding prompt tokens for accurate decomposition of complex visual concepts. To improve concept-token binding accuracy, we design a Diversify-and-Absorb Mechanism that uses an extra absorbent token to eliminate the impact of concept-irrelevant details when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
