TL;DR
This paper introduces a method for precise camera control in text-to-image generation by learning parametric camera tokens, enabling better geometric understanding and transferability across object categories.
Contribution
It presents a novel framework that learns viewpoint-conditioned tokens for improved camera control, combining 3D geometric supervision with diverse image augmentations.
Findings
Achieves state-of-the-art accuracy in camera control
Preserves image quality and prompt fidelity
Tokens transfer to unseen object categories
Abstract
Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
