Temporal and Contextual Transformer for Multi-Camera Editing of TV Shows
Anyi Rao, Xuekun Jiang, Sichen Wang, Yuwei Guo, Zihao Liu, Bo Dai,, Long Pang, Xiaoyu Wu, Dahua Lin, Libiao Jin

TL;DR
This paper introduces a new benchmark dataset for multi-camera TV show editing and proposes a transformer-based model that leverages historical and multi-view context to improve shot selection and view prediction.
Contribution
The paper presents a novel benchmark dataset for multi-camera editing and a transformer model that effectively utilizes temporal and contextual cues for shot transition prediction.
Findings
Our method outperforms existing approaches on the benchmark.
The dataset covers diverse scenarios including concerts and sports.
The approach improves editing quality by leveraging historical and multi-view information.
Abstract
The ability to choose an appropriate camera view among multiple cameras plays a vital role in TV shows delivery. But it is hard to figure out the statistical pattern and apply intelligent processing due to the lack of high-quality training data. To solve this issue, we first collect a novel benchmark on this setting with four diverse scenarios including concerts, sports games, gala shows, and contests, where each scenario contains 6 synchronized tracks recorded by different cameras. It contains 88-hour raw videos that contribute to the 14-hour edited videos. Based on this benchmark, we further propose a new approach temporal and contextual transformer that utilizes clues from historical shots and other views to make shot transition decisions and predict which view to be used. Extensive experiments show that our method outperforms existing methods on the proposed multi-camera editing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Video Coding and Compression Technologies · Cinema and Media Studies
MethodsGlobal-and-Local attention
