TROPE: TRaining-Free Object-Part Enhancement for Seamlessly Improving Fine-Grained Zero-Shot Image Captioning
Joshua Feinglass, Yezhou Yang

TL;DR
TROPE is a training-free method that enhances zero-shot image captioning by adding object-part details, significantly improving performance on fine-grained datasets without retraining models.
Contribution
Introducing TROPE, a novel training-free approach that enriches captions with object-part details using object detectors and NLP, boosting zero-shot image captioning on fine-grained datasets.
Findings
Consistently improves zero-shot captioning performance.
Achieves state-of-the-art results on fine-grained datasets.
Seamlessly integrates with existing captioning methods.
Abstract
Zero-shot inference, where pre-trained models perform tasks without specific training data, is an exciting emergent ability of large models like CLIP. Although there has been considerable exploration into enhancing zero-shot abilities in image captioning (IC) for popular datasets such as MSCOCO and Flickr8k, these approaches fall short with fine-grained datasets like CUB, FLO, UCM-Captions, and Sydney-Captions. These datasets require captions to discern between visually and semantically similar classes, focusing on detailed object parts and their attributes. To overcome this challenge, we introduce TRaining-Free Object-Part Enhancement (TROPE). TROPE enriches a base caption with additional object-part details using object detector proposals and Natural Language Processing techniques. It complements rather than alters the base caption, allowing seamless integration with other captioning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Vision and Imaging
MethodsBalanced Selection · Contrastive Language-Image Pre-training
