Unifying Event Detection and Captioning as Sequence Generation via   Pre-Training

Qi Zhang; Yuqing Song; Qin Jin

arXiv:2207.08625·cs.CV·July 24, 2023

Unifying Event Detection and Captioning as Sequence Generation via Pre-Training

Qi Zhang, Yuqing Song, Qin Jin

PDF

Open Access 1 Repo

TL;DR

This paper introduces a unified sequence generation framework for dense video captioning that combines event detection and captioning, leveraging pre-training to improve inter-task association and temporal consistency.

Contribution

It proposes a novel unified pre-training and fine-tuning approach that models event detection as sequence generation, enhancing inter-task interaction and temporal dependency modeling.

Findings

01

Outperforms state-of-the-art on ActivityNet dataset

02

Benefits from large-scale video-text pre-training

03

Detects more diverse and consistent events

Abstract

Dense video captioning aims to generate corresponding text descriptions for a series of events in the untrimmed video, which can be divided into two sub-tasks, event detection and event captioning. Unlike previous works that tackle the two sub-tasks separately, recent works have focused on enhancing the inter-task association between the two sub-tasks. However, designing inter-task interactions for event detection and captioning is not trivial due to the large differences in their task specific solutions. Besides, previous event detection methods normally ignore temporal dependencies between events, leading to event redundancy or inconsistency problems. To tackle above the two defects, in this paper, we define event detection as a sequence generation task and propose a unified pre-training and fine-tuning framework to naturally enhance the inter-task association between event detection…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qiqang/uedvc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Pose and Action Recognition