Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual   Context for Image Captioning

Chia-Wen Kuo; Zsolt Kira

arXiv:2205.04363·cs.CV·June 9, 2022

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning

Chia-Wen Kuo, Zsolt Kira

PDF

Open Access 1 Repo

TL;DR

This paper enhances image captioning by integrating cross-modal textual and visual context, utilizing a multi-modal pre-trained model and additional relational information to improve grounding and caption quality.

Contribution

It introduces a novel approach that incorporates auxiliary relational data and a multi-modal pre-trained model to improve image captioning beyond fixed object detector outputs.

Findings

01

Achieved +7.5% CIDEr score improvement.

02

Demonstrated better grounding of objects in captions.

03

Validated the importance of multi-modal pre-trained models.

Abstract

Significant progress has been made on visual captioning, largely relying on pre-trained features and later fixed object detectors that serve as rich inputs to auto-regressive models. A key limitation of such methods, however, is that the output of the model is conditioned only on the object detector's outputs. The assumption that such outputs can represent all necessary information is unrealistic, especially when the detector is transferred across datasets. In this work, we reason about the graphical model induced by this assumption, and propose to add an auxiliary input to represent missing information such as object relationships. We specifically propose to mine attributes and relationships from the Visual Genome dataset and condition the captioning model on them. Crucially, we propose (and show to be important) the use of a multi-modal pre-trained model (CLIP) to retrieve such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

GT-RIPL/Xmodal-Ctx
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning