Text-Driven Stylization of Video Objects
Sebastian Loeschcke, Serge Belongie, Sagie Benaim

TL;DR
This paper introduces a method for semantically stylizing video objects based on user text prompts, ensuring temporal consistency and preservation of details by leveraging CLIP and an atlas decomposition network.
Contribution
The novel approach combines global and local text prompts with CLIP similarity and an atlas network to achieve temporally consistent, detailed, and user-guided video object stylization.
Findings
Produces consistent style changes over time
Adheres to user-specified text prompts
Allows varying levels of stylization detail
Abstract
We tackle the task of stylizing video objects in an intuitive and semantic manner following a user-specified text prompt. This is a challenging task as the resulting video must satisfy multiple properties: (1) it has to be temporally consistent and avoid jittering or similar artifacts, (2) the resulting stylization must preserve both the global semantics of the object and its fine-grained details, and (3) it must adhere to the user-specified text prompt. To this end, our method stylizes an object in a video according to two target texts. The first target text prompt describes the global semantics and the second target text prompt describes the local semantics. To modify the style of an object, we harness the representational power of CLIP to get a similarity score between (1) the local target text and a set of local stylized views, and (2) a global target text and a set of stylized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Digital Humanities and Scholarship · Handwritten Text Recognition Techniques
MethodsContrastive Language-Image Pre-training
