TL;DR
This paper introduces a new task for multi-sentence video description that maintains consistent person identities across clips, utilizing a Transformer-based approach and gender-aware representations to improve re-identification and description coherence.
Contribution
It proposes the Fill-in the Identity auxiliary task and a two-stage approach for identity-aware video description, enhancing link consistency of persons across video segments.
Findings
The Fill-in the Identity model outperforms baselines and recent methods.
The approach enables coherent multi-sentence descriptions with re-identified persons.
Augmented LSMDC benchmark supports the new task and evaluation.
Abstract
Standard video and movie description tasks abstract away from person identities, thus failing to link identities across sentences. We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips, when the video descriptions are given. Our proposed approach to this task leverages a Transformer architecture allowing for coherent joint prediction of multiple IDs. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description. We first generate multi-sentence video descriptions, and then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Dropout · Label Smoothing · Multi-Head Attention · Residual Connection · Softmax
