Dual-Level Decoupled Transformer for Video Captioning
Yiqi Gao, Xinglin Hou, Wei Suo, Mengyang Sun, Tiezheng Ge, Yuning, Jiang, Peng Wang

TL;DR
This paper introduces a dual-level decoupled transformer for video captioning that improves spatio-temporal representation and sentence generation by separating concerns and leveraging dedicated models, leading to superior performance.
Contribution
The proposed $ ext{D}^2$ model decouples spatio-temporal representation and sentence generation, enabling end-to-end training and better utilization of pre-trained models.
Findings
Outperforms previous methods on MSVD, MSR-VTT, and VATEX benchmarks.
Effectively decouples spatial and temporal modeling for improved video understanding.
Introduces a syntax-aware decoder that dynamically balances semantic and syntactic word contributions.
Abstract
Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from \textit{offline-extracted} motion or appearance features from \textit{pre-trained} vision models. However, these methods may suffer from the so-called \textbf{\textit{"couple"}} drawbacks on both \textit{video spatio-temporal representation} and \textit{sentence generation}. For the former, \textbf{\textit{"couple"}} means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named \emph{disconnection in task/pre-train domain} and \emph{hard for end-to-end training}. As for the latter, \textbf{\textit{"couple"}} means treating the generation of visual semantic and syntax-related words equally. To this end, we present - a dual-level decoupled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Cancer-related molecular mechanisms research
