DiffusionCinema: Text-to-Aerial Cinematography

Valerii Serpiva; Artem Lykov; Jeffrin Sam; Aleksey Fedoseev; Dzmitry Tsetserukou

arXiv:2601.17412·cs.RO·January 27, 2026

DiffusionCinema: Text-to-Aerial Cinematography

Valerii Serpiva, Artem Lykov, Jeffrin Sam, Aleksey Fedoseev, Dzmitry Tsetserukou

PDF

Open Access

TL;DR

DiffusionCinema introduces a UAV system that interprets natural language prompts to autonomously generate and execute cinematic flight trajectories, simplifying aerial filming and reducing user workload.

Contribution

The paper presents a novel diffusion model-based approach for autonomous UAV cinematography driven by natural language descriptions, enabling intuitive, text-based control of drone flight paths.

Findings

01

Lower user workload with the system compared to traditional remote control

02

Significant reduction in mental demand and frustration levels

03

Successful generation of cinematic flight trajectories from natural language prompts

Abstract

We propose a novel Unmanned Aerial Vehicles (UAV) assisted creative capture system that leverages diffusion models to interpret high-level natural language prompts and automatically generate optimal flight trajectories for cinematic video recording. Instead of manually piloting the drone, the user simply describes the desired shot (e.g., "orbit around me slowly from the right and reveal the background waterfall"). Our system encodes the prompt along with an initial visual snapshot from the onboard camera, and a diffusion model samples plausible spatio-temporal motion plans that satisfy both the scene geometry and shot semantics. The generated flight trajectory is then executed autonomously by the UAV to record smooth, repeatable video clips that match the prompt. User evaluation using NASA-TLX showed a significantly lower overall workload with our interface (M = 21.6) compared to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Social Robot Interaction and HRI · Multimodal Machine Learning Applications