Word to Sentence Visual Semantic Similarity for Caption Generation: Lessons Learned
Ahmed Sabir

TL;DR
This paper introduces a visual semantic similarity approach to improve image captioning by selecting captions most related to the image, rather than the most probable output, enhancing caption relevance.
Contribution
It presents a post-processing method that uses visual semantic measures to select more accurate captions for images, applicable to any captioning system.
Findings
Improved caption relevance through semantic matching
Applicable as a post-processing step for existing systems
Enhances image-caption alignment accuracy
Abstract
This paper focuses on enhancing the captions generated by image-caption generation systems. We propose an approach for improving caption generation systems by choosing the most closely related output to the image rather than the most likely output produced by the model. Our model revises the language generation output beam search from a visual context perspective. We employ a visual semantic measure in a word and sentence level manner to match the proper caption to the related information in the image. The proposed approach can be applied to any caption system as a post-processing based method.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
