MICap: A Unified Model for Identity-aware Movie Descriptions

Haran Raajesh; Naveen Reddy Desanur; Zeeshan Khan; Makarand Tapaswi

arXiv:2405.11483·cs.CV·May 21, 2024

MICap: A Unified Model for Identity-aware Movie Descriptions

Haran Raajesh, Naveen Reddy Desanur, Zeeshan Khan, Makarand Tapaswi

PDF

Open Access

TL;DR

This paper introduces MICap, a unified model for identity-aware movie captioning that can generate captions with character identities or fill-in-the-blanks, improving accuracy and evaluation metrics.

Contribution

MICap is a novel single-stage model that seamlessly switches between id-aware caption generation and fill-in-the-blanks tasks, with a new evaluation metric iSPICE for identity accuracy.

Findings

01

4.2% improvement in FITB accuracy

02

1-2% improvement in captioning metrics

03

Effective unified approach for identity-aware captioning

Abstract

Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names), recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task, where, given a caption with blanks, the goal is to predict person id labels. However, to predict captions with ids, a two-stage approach is required: first predict captions with someone, then fill in identities. In this work, we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model, Movie-Identity Captioner (MICap), uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives, while the encoder can benefit from or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Natural Language Processing Techniques · Music and Audio Processing