Edit As You Wish: Video Caption Editing with Multi-grained User Control
Linli Yao, Yuanmeng Zhang, Ziheng Wang, Xinglin Hou, Tiezheng Ge,, Yuning Jiang, Xu Sun, Qin Jin

TL;DR
This paper introduces a new video caption editing task allowing users to revise descriptions with multi-grained control, supported by new datasets and a specialized model, to better meet diverse and dynamic user needs.
Contribution
The paper proposes the VCE task with a triplet command format, constructs new benchmark datasets, and develops a dedicated small-scale model for effective caption editing.
Findings
The VCE task is challenging due to fine-grained semantic understanding.
The proposed datasets enable comprehensive evaluation of caption editing.
The specialized model outperforms generalist models in the task.
Abstract
Automatically narrating videos in natural language complying with user requests, i.e. Controllable Video Captioning task, can help people manage massive videos with desired intentions. However, existing works suffer from two shortcomings: 1) the control signal is single-grained which can not satisfy diverse user intentions; 2) the video description is generated in a single round which can not be further edited to meet dynamic needs. In this paper, we propose a novel \textbf{V}ideo \textbf{C}aption \textbf{E}diting \textbf{(VCE)} task to automatically revise an existing video description guided by multi-grained user requests. Inspired by human writing-revision habits, we design the user command as a pivotal triplet \{\textit{operation, position, attribute}\} to cover diverse user needs from coarse-grained to fine-grained. To facilitate the VCE task, we \textit{automatically} construct an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
