CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

Haoyu Zhao; Zihao Zhang; Jiaxi Gu; Haoran Chen; Qingping Zheng; Pin Tang; Yeyin Jin; Yuang Zhang; Junqi Cheng; Zenghui Lu; Peng Shu; Zuxuan Wu; Yu-Gang Jiang

arXiv:2604.09201·cs.CV·April 13, 2026

CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

Haoyu Zhao, Zihao Zhang, Jiaxi Gu, Haoran Chen, Qingping Zheng, Pin Tang, Yeyin Jin, Yuang Zhang, Junqi Cheng, Zenghui Lu, Peng Shu, Zuxuan Wu, Yu-Gang Jiang

PDF

1 Repo

TL;DR

This paper introduces CT-1, a novel vision-language-camera model that transfers spatial reasoning to generate camera-controllable videos with high accuracy, leveraging a large dataset and a frequency domain regularization technique.

Contribution

The paper presents a new model, CT-1, which effectively transfers spatial reasoning to video generation, and introduces CT-200K, a large-scale dataset for training such models.

Findings

01

Improved camera control accuracy by 25.7% over previous methods.

02

Successfully generates high-quality, spatially aware camera-controllable videos.

03

Employs Wavelet-based Regularization Loss to learn complex camera trajectories.

Abstract

Camera-controllable video generation aims to synthesize videos with flexible and physically plausible camera movements. However, existing methods either provide imprecise camera control from text prompts or rely on labor-intensive manual camera trajectory parameters, limiting their use in automated scenarios. To address these issues, we propose a novel Vision-Language-Camera model, termed CT-1 (Camera Transformer 1), a specialized model designed to transfer spatial reasoning knowledge to video generation by accurately estimating camera trajectories. Built upon vision-language modules and a Diffusion Transformer model, CT-1 employs a Wavelet-based Regularization Loss in the frequency domain to effectively learn complex camera trajectory distributions. These trajectories are integrated into a video diffusion model to enable spatially aware camera control that aligns with user intentions.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gulucaptain/Camera-Transformer-1
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.