TL;DR
This paper introduces Live Video Captioning, a new online task for generating captions for streaming videos, along with a novel model, evaluation metrics, and extensive experiments demonstrating its effectiveness over traditional offline methods.
Contribution
The paper formally defines the Live Video Captioning problem, proposes innovative evaluation metrics, and develops a deformable transformer-based model for real-time captioning of video streams.
Findings
The proposed model outperforms state-of-the-art offline methods in live captioning tasks.
New evaluation metrics better capture the performance of online captioning systems.
Extensive experiments validate the effectiveness of the proposed approach.
Abstract
Dense video captioning involves detecting and describing events within video sequences. Traditional methods operate in an offline setting, assuming the entire video is available for analysis. In contrast, in this work we introduce a groundbreaking paradigm: Live Video Captioning (LVC), where captions must be generated for video streams in an online manner. This shift brings unique challenges, including processing partial observations of the events and the need for a temporal anticipation of the actions. We formally define the novel problem of LVC and propose innovative evaluation metrics specifically designed for this online scenario, demonstrating their advantages over traditional metrics. To address the novel complexities of LVC, we present a new model that combines deformable transformers with temporal filtering, enabling effective captioning over video streams. Extensive experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
