Towards a Robust Aerial Cinematography Platform: Localizing and Tracking Moving Targets in Unstructured Environments
Rogerio Bonatti, Cherie Ho, Wenshan Wang, Sanjiban Choudhury,, Sebastian Scherer

TL;DR
This paper presents an autonomous aerial cinematography system that localizes and tracks moving targets in unstructured environments using vision-based methods, real-time mapping, and optimized camera planning, outperforming existing approaches.
Contribution
It introduces a complete system combining vision-based target localization, real-time 3D mapping, and an advanced camera planner for autonomous drone cinematography in unstructured settings.
Findings
System achieves state-of-the-art performance in robustness and real-time tracking.
Successfully operates in unknown, unstructured environments without prior maps.
Demonstrates effective target localization and smooth camera control in dynamic scenarios.
Abstract
The use of drones for aerial cinematography has revolutionized several applications and industries that require live and dynamic camera viewpoints such as entertainment, sports, and security. However, safely controlling a drone while filming a moving target usually requires multiple expert human operators; hence the need for an autonomous cinematographer. Current approaches have severe real-life limitations such as requiring fully scripted scenes, high-precision motion-capture systems or GPS tags to localize targets, and prior maps of the environment to avoid obstacles and plan for occlusion. In this work, we overcome such limitations and propose a complete system for aerial cinematography that combines: (1) a vision-based algorithm for target localization; (2) a real-time incremental 3D signed-distance map algorithm for occlusion and safety computation; and (3) a real-time camera…
| \pbox0cmRef | \pbox0.6cmOnline |
| map |
| \pbox0.2cmSystem | \pbox1.0cmModule | \pbox2.0cmCPU | ||
| \pbox2.0cmRAM (MB) | \pbox2.0cmRuntime (ms) | |||
| Detection | 57 | 2160 | 145 | |
| Vision | Tracking | 24 | 25 | 14.4 |
| Heading | 24 | 1768 | 13.9 | |
| KF | 8 | 80 | 0.207 | |
| Grid | 22 | 48 | 36.8 | |
| Mapping | TSDF | 91 | 810 | 100-6000 |
| LiDAR | 24 | 9 | NA | |
| Planning | Planner | 98 | 789 | 198 |
| DJI SDK | 89 | 40 | NA |
| \pbox2.5cmPlanning Condition | \pbox2.5cmAvg. plan | ||
|---|---|---|---|
| \pbox1.5cmAvg. cost | \pbox1.5cmMedian cost | ||
| Ground-truth map | 32.1 | 0.1022 | 0.0603 |
| Online map | 69.0 | 0.1102 | 0.0825 |
| Ground-truth actor | 36.5 | 0.0539 | 0.0475 |
| Noise in actor | 30.2 | 0.1276 | 0.0953 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Towards a Robust Aerial Cinematography Platform:
Localizing and Tracking Moving Targets in Unstructured Environments
Rogerio Bonatti 1, Cherie Ho 1, Wenshan Wang 1, Sanjiban Choudhury 2 and Sebastian Scherer 1 Research presented in this paper was funded by Yamaha Motor Co., Ltd.1R. Bonatti, C. Ho, W. Wang, S. Scherer belong to The Robotics Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh PA {rbonatti,cherieh,wenshanw,basti}@cs.cmu.edu2*S. Choudhury is with the School of Computer Science and Engineering of the University of Washington, Seattle, WA [email protected]
Abstract
The use of drones for aerial cinematography has revolutionized several applications and industries that require live and dynamic camera viewpoints such as entertainment, sports, and security. However, safely controlling a drone while filming a moving target usually requires multiple expert human operators; hence the need for an autonomous cinematographer. Current approaches have severe real-life limitations such as requiring fully scripted scenes, high-precision motion-capture systems or GPS tags to localize targets, and prior maps of the environment to avoid obstacles and plan for occlusion.
In this work, we overcome such limitations and propose a complete system for aerial cinematography that combines: (1) a vision-based algorithm for target localization; (2) a real-time incremental 3D signed-distance map algorithm for occlusion and safety computation; and (3) a real-time camera motion planner that optimizes smoothness, collisions, occlusions and artistic guidelines. We evaluate robustness and real-time performance in series of field experiments and simulations by tracking dynamic targets moving through unknown, unstructured environments. Finally, we verify that despite removing previous limitations, our system achieves state-of-the-art performance.
I Introduction
In this paper, we address the problem of autonomous cinematography using unmanned aerial vehicles (UAVs). Specifically, we focus on scenarios where an UAV must film an actor moving through an unknown environment at high speeds, in an unscripted manner. Filming dynamic actors among clutter is extremely challenging, even for experienced pilots. It takes high attention and effort to simultaneously predict how the scene is going to evolve, control the UAV, avoid obstacles and reach desired viewpoints. Towards solving this problem, we present a complete system that can autonomously handle the real-life constraints involved in aerial cinematography: tracking the actor, mapping out the surrounding terrain and planning maneuvers to capture high quality, artistic shots.
Consider the typical filming scenario in Fig 1. The UAV must accomplish a number of tasks. First, it must estimate the actor’s pose using an onboard camera and forecast their future motion. The pose estimation should be robust to changing viewpoints, backgrounds and lighting conditions. Accurate forecasting is key for anticipating events which require changing camera viewpoints. Secondly, the UAV must remain safe as it flies through through new environments. Safety requires explicit modelling of environmental uncertainty. Finally, the UAV must capture high quality videos which require maximizing a set of artistic guidelines. The key challenge is that all these tasks must be done in real-time under limited onboard computational resources.
There is a rich history of work in autonomous aerial filming that tackles parts of the challenges. For instance, several works focus on artistic guidelines [1, 2, 3, 4] but often rely on perfect actor localization through high-precision RTK GNSS or motion-capture systems. Additionally, while the majority of work in the area deals with collisions between UAV and actors[2, 1, 5], the environment is not factored in. While there are several successful commercial products, they too have certain limitations to either low speed and low clutter regimes (e.g. DJI Mavic [6]) or shorter planning horizons (e.g. Skydio R1 [7]). Even our previous work [8], despite handling environmental occlusions and collisions, assumes a prior elevation map and uses GPS to localize the actor. Such simplifications impose restrictions on the diversity of real-life scenarios that these systems can handle.
We address these challenges by building upon previous work that formulates the problem as an efficient real-time trajectory optimization [8]. In this work we make a key observation: we don’t need prior ground-truth information about the scene; our onboard sensors suffice to attain good performance. However, sensor data is noisy and needs to be processed in real-time; therefore we develop robust and efficient algorithms. To localize the actor, we use a visual tracking system. To map the environment, we use a long-range LiDAR and process it incrementally to build a signed distance field of the environment. Combining both methods, we can plan over long horizons in unknown environments to film fast dynamic actors according to artistic guidelines. In summary, our main contributions in this paper are threefold:
We develop an incremental signed distance transform algorithm for large-scale real-time environment mapping (Section IV-B); 2. 2.
We develop a complete system for autonomous cinematography that includes visual actor localization, online mapping, and efficient trajectory optimization that can deal with noisy measurements (Section IV); 3. 3.
We offer extensive quantitative and qualitative performance evaluations of our system both in simulation and field tests, while also comparing performance changes with scenarios with full map and actor knowledge (Section V).
II Problem Formulation
The overall task is to control a UAV to film an actor who is moving through an unknown environment. We formulate this as a trajectory optimization problem where the cost function measures shot quality, environmental occlusion of the actor, jerkiness of motion and safety. This cost function depends on the environment and the actor, both of which must be sensed on-the-fly. The changing nature of environment and actor trajectory also demands re-planning at a high frequency.
Let be the trajectory of the UAV, i.e., . Let be the trajectory of the actor, . The state of the actor, as sensed by onboard cameras, is fed into a prediction module that computes (Section IV-A).
Let grid be a voxel occupancy grid that maps every point in space to a probability of occupancy. Let be the signed distance values of a point to the nearest obstacle. Positive sign is for points in free space, and negative sign is for points either in occupied or unknown space, which we assume to be potentially inside an obstacle. The UAV senses the environment with the onboard LiDAR, updates grid , and then updates (Section IV-B).
We briefly touch upon the four components of the cost function (refer to Section IV-C for mathematical expressions). The objective is to minimize subject to initial boundary constraints .
Smoothness : Penalizes jerky motions that may lead to camera blur and unstable flight; 2. 2.
Shot quality : Penalizes poor viewpoint angles and scales that deviate from the artistic guidelines 3. 3.
Safety : Penalizes proximity to obstacles that are unsafe for the UAV. 4. 4.
Occlusion : Penalizes occlusion of the actor by obstacles in the environment.
[TABLE]
The solution is then tracked by the UAV.
III Related Work
Virtual cinematography
Camera control in virtual cinematography has been extensively examined by the computer graphics community, as reviewed by [9]. These methods tend to reason about the utility of a viewpoint in isolation, following artistic principles and composition rules [10, 11] and employ either optimization-based approaches to find good viewpoints, or reactive approaches to track the virtual actor. The focus is typically on through-the-lens control where a virtual camera is manipulated while maintaining focus on certain image features [12, 13, 14, 15]. However, virtual cinematography is free of several real-world limitations such as robot physics constraints and assumes full map knowledge.
Autonomous aerial cinematography
Several contributions on aerial cinematography focus on keyframe navigation. [16, 17, 18, 19, 20] provide user interface tools for re-timing and connecting static aerial viewpoints for dynamically feasible and visually pleasing trajectories. [21] use key-frames defined on the image itself instead of world coordinates.
Other works focus on tracking dynamic targets, and employ a diverse set of techniques for actor localization and navigation. For example, [5, 22] detect the skeleton of targets from visual input, while others approaches rely on off-board actor localization methods from either motion-capture systems or GPS sensors [1, 3, 2, 4, 8]. These approaches have a varying level of complexity: [8, 4] can avoid obstacles and occlusions with the environment and with actors, while other approaches only handle collisions and occlusions caused by actors. We also observe distinct trajectory generation methods randing from trajectory optimization to search-based planners. In Table III we summarize different contributions, also differentiating onboard versus off-board computing systems. It is important to notice that prior to our current work, none of the previous approaches provided a solution for online environment mapping.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] N. Joubert, D. B. Goldman, F. Berthouzoz, M. Roberts, J. A. Landay, P. Hanrahan et al. , “Towards a drone cinematographer: Guiding quadrotor cameras using visual composition principles,” ar Xiv preprint ar Xiv:1610.01691 , 2016.
- 2[2] T. Nägeli, L. Meier, A. Domahidi, J. Alonso-Mora, and O. Hilliges, “Real-time planning for automated multi-view drone cinematography,” ACM Transactions on Graphics (TOG) , vol. 36, no. 4, p. 132, 2017.
- 3[3] Q. Galvane, J. Fleureau, F. L. Tariolle, and P. Guillotel, “Automated cinematography with unmanned aerial vehicles,” in Proceedings of the Eurographics Workshop on Intelligent Cinematography and Editing , 2016.
- 4[4] Q. Galvane, C. Lino, M. Christie, J. Fleureau, F. Servant, F. Tariolle, P. Guillotel et al. , “Directing cinematographic drones,” ACM Transactions on Graphics (TOG) , vol. 37, no. 3, p. 34, 2018.
- 5[5] C. Huang, F. Gao, J. Pan, Z. Yang, W. Qiu, P. Chen, X. Yang, S. Shen, and K.-T. T. Cheng, “Act: An autonomous drone cinematography system for action scenes,” in 2018 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2018, pp. 7039–7046.
- 6[6] “Dji mavic,” https://www.dji.com/mavic , accessed: 2019-02-28.
- 7[7] Skydio. (2018) Skydio R 1 self-flying camera. [Online]. Available: https://www.skydio.com/technology/
- 8[8] R. Bonatti, Y. Zhang, S. Choudhury, W. Wang, and S. Scherer, “Autonomous drone cinematographer: Using artistic principles to create smooth, safe, occlusion-free trajectories for aerial filming,” International Symposium on Experimental Robotics , 2018.
