Caption Generation on Scenes with Seen and Unseen Object Categories

Berkan Demirel; Ramazan Gokberk Cinbis

arXiv:2108.06165·cs.CV·July 4, 2022

Caption Generation on Scenes with Seen and Unseen Object Categories

Berkan Demirel, Ramazan Gokberk Cinbis

PDF

TL;DR

This paper introduces a zero-shot scene captioning framework that detects both seen and unseen objects using a generalized detection model and generates captions via templates, addressing the challenge of describing scenes with novel objects.

Contribution

It presents a detection-driven zero-shot captioning method with class similarity-based representations and a new evaluation metric for visual and non-visual content assessment.

Findings

01

Effective recognition of unseen objects in scenes.

02

Improved caption quality with the proposed approach.

03

New insights into zero-shot captioning evaluation.

Abstract

Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detection model to recognize and localize instances of both seen and unseen classes, and a template-based captioning model that transforms detections into sentences. To improve the generalized zero-shot detection model, which provides essential information for captioning, we define effective class representations in terms of class-to-class semantic similarities, and leverage their special structure to construct an effective unseen/seen class confidence score calibration mechanism. We also propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.