Word to Sentence Visual Semantic Similarity for Caption Generation:   Lessons Learned

Ahmed Sabir

arXiv:2209.12817·cs.CL·July 10, 2023

Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned

Ahmed Sabir

PDF

Open Access

TL;DR

This paper introduces a visual semantic similarity approach to improve image captioning by selecting captions most related to the image, rather than the most probable output, enhancing caption relevance.

Contribution

It presents a post-processing method that uses visual semantic measures to select more accurate captions for images, applicable to any captioning system.

Findings

01

Improved caption relevance through semantic matching

02

Applicable as a post-processing step for existing systems

03

Enhances image-caption alignment accuracy

Abstract

This paper focuses on enhancing the captions generated by image-caption generation systems. We propose an approach for improving caption generation systems by choosing the most closely related output to the image rather than the most likely output produced by the model. Our model revises the language generation output beam search from a visual context perspective. We employ a visual semantic measure in a word and sentence level manner to match the proper caption to the related information in the image. The proposed approach can be applied to any caption system as a post-processing based method.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques