Dual-Level Decoupled Transformer for Video Captioning

Yiqi Gao; Xinglin Hou; Wei Suo; Mengyang Sun; Tiezheng Ge; Yuning; Jiang; Peng Wang

arXiv:2205.03039·cs.CV·May 9, 2022

Dual-Level Decoupled Transformer for Video Captioning

Yiqi Gao, Xinglin Hou, Wei Suo, Mengyang Sun, Tiezheng Ge, Yuning, Jiang, Peng Wang

PDF

Open Access

TL;DR

This paper introduces a dual-level decoupled transformer for video captioning that improves spatio-temporal representation and sentence generation by separating concerns and leveraging dedicated models, leading to superior performance.

Contribution

The proposed $ ext{D}^2$ model decouples spatio-temporal representation and sentence generation, enabling end-to-end training and better utilization of pre-trained models.

Findings

01

Outperforms previous methods on MSVD, MSR-VTT, and VATEX benchmarks.

02

Effectively decouples spatial and temporal modeling for improved video understanding.

03

Introduces a syntax-aware decoder that dynamically balances semantic and syntactic word contributions.

Abstract

Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from \textit{offline-extracted} motion or appearance features from \textit{pre-trained} vision models. However, these methods may suffer from the so-called \textbf{\textit{"couple"}} drawbacks on both \textit{video spatio-temporal representation} and \textit{sentence generation}. For the former, \textbf{\textit{"couple"}} means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named \emph{disconnection in task/pre-train domain} and \emph{hard for end-to-end training}. As for the latter, \textbf{\textit{"couple"}} means treating the generation of visual semantic and syntax-related words equally. To this end, we present $D^{2}$ - a dual-level decoupled…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research