Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
Edoardo A. Dominici, Thomas Deixelberger, Konstantinos Vardis, Markus Steinberger

TL;DR
Control-DINO introduces a method to use self-supervised image features for controllable video generation, enabling tasks like stylization and relighting with a lightweight architecture.
Contribution
It presents a novel approach to leverage DINO features for controlling pretrained video diffusion models, decoupling appearance from other scene attributes.
Findings
Effective video domain transfer demonstrated.
Improved controllability with higher feature dimensionality.
Decoupling appearance enables robust stylization and relighting.
Abstract
Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
