Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

Xinxuan Lu; Charless Fowlkes; Alexander C. Berg

arXiv:2604.19954·cs.CV·April 23, 2026

Camera Control for Text-to-Image Generation via Learning Viewpoint Tokens

Xinxuan Lu, Charless Fowlkes, Alexander C. Berg

PDF

1 Repo

TL;DR

This paper introduces a method for precise camera control in text-to-image generation by learning parametric camera tokens, enabling better geometric understanding and transferability across object categories.

Contribution

It presents a novel framework that learns viewpoint-conditioned tokens for improved camera control, combining 3D geometric supervision with diverse image augmentations.

Findings

01

Achieves state-of-the-art accuracy in camera control

02

Preserves image quality and prompt fidelity

03

Tokens transfer to unseen object categories

Abstract

Current text-to-image models struggle to provide precise camera control using natural language alone. In this work, we present a framework for precise camera control with global scene understanding in text-to-image generation by learning parametric camera tokens. We fine-tune image generation models for viewpoint-conditioned text-to-image generation on a curated dataset that combines 3D-rendered images for geometric supervision and photorealistic augmentations for appearance and background diversity. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art accuracy while preserving image quality and prompt fidelity. Unlike prior methods that overfit to object-specific appearance correlations, our viewpoint tokens learn factorized geometric representations that transfer to unseen object categories. Our work shows that text-vision latent spaces can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://randdl.github.io/viewtoken_control
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.