TROPE: TRaining-Free Object-Part Enhancement for Seamlessly Improving   Fine-Grained Zero-Shot Image Captioning

Joshua Feinglass; Yezhou Yang

arXiv:2409.19960·cs.CV·November 5, 2024

TROPE: TRaining-Free Object-Part Enhancement for Seamlessly Improving Fine-Grained Zero-Shot Image Captioning

Joshua Feinglass, Yezhou Yang

PDF

Open Access 1 Repo

TL;DR

TROPE is a training-free method that enhances zero-shot image captioning by adding object-part details, significantly improving performance on fine-grained datasets without retraining models.

Contribution

Introducing TROPE, a novel training-free approach that enriches captions with object-part details using object detectors and NLP, boosting zero-shot image captioning on fine-grained datasets.

Findings

01

Consistently improves zero-shot captioning performance.

02

Achieves state-of-the-art results on fine-grained datasets.

03

Seamlessly integrates with existing captioning methods.

Abstract

Zero-shot inference, where pre-trained models perform tasks without specific training data, is an exciting emergent ability of large models like CLIP. Although there has been considerable exploration into enhancing zero-shot abilities in image captioning (IC) for popular datasets such as MSCOCO and Flickr8k, these approaches fall short with fine-grained datasets like CUB, FLO, UCM-Captions, and Sydney-Captions. These datasets require captions to discern between visually and semantically similar classes, focusing on detailed object parts and their attributes. To overcome this challenge, we introduce TRaining-Free Object-Part Enhancement (TROPE). TROPE enriches a base caption with additional object-part details using object detector proposals and Natural Language Processing techniques. It complements rather than alters the base caption, allowing seamless integration with other captioning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JoshuaFeinglass/TROPE
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging

MethodsBalanced Selection · Contrastive Language-Image Pre-training