Identity-Aware Multi-Sentence Video Description

Jae Sung Park; Trevor Darrell; Anna Rohrbach

arXiv:2008.09791·cs.CV·August 25, 2020

Identity-Aware Multi-Sentence Video Description

Jae Sung Park, Trevor Darrell, Anna Rohrbach

PDF

1 Repo

TL;DR

This paper introduces a new task for multi-sentence video description that maintains consistent person identities across clips, utilizing a Transformer-based approach and gender-aware representations to improve re-identification and description coherence.

Contribution

It proposes the Fill-in the Identity auxiliary task and a two-stage approach for identity-aware video description, enhancing link consistency of persons across video segments.

Findings

01

The Fill-in the Identity model outperforms baselines and recent methods.

02

The approach enables coherent multi-sentence descriptions with re-identified persons.

03

Augmented LSMDC benchmark supports the new task and evaluation.

Abstract

Standard video and movie description tasks abstract away from person identities, thus failing to link identities across sentences. We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips, when the video descriptions are given. Our proposed approach to this task leverages a Transformer architecture allowing for coherent joint prediction of multiple IDs. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description. We first generate multi-sentence video descriptions, and then…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jamespark3922/lsmdc-fillin
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Dropout · Label Smoothing · Multi-Head Attention · Residual Connection · Softmax