Comprehending and Ordering Semantics for Image Captioning
Yehao Li, Yingwei Pan, Ting Yao, Tao Mei

TL;DR
This paper introduces COS-Net, a Transformer-based model that comprehends and orders image semantics to generate more coherent and accurate image captions, outperforming existing methods on standard benchmarks.
Contribution
The paper proposes a novel unified architecture that combines semantic comprehension and ordering for image captioning, utilizing a cross-modal retrieval, semantic filtering, and ranking process.
Findings
COS-Net surpasses state-of-the-art on COCO dataset.
Achieves the highest CIDEr score of 141.1% on Karpathy split.
Demonstrates effective semantic comprehension and ordering in caption generation.
Abstract
Comprehending the rich semantics in an image and ordering them in linguistic order are essential to compose a visually-grounded and linguistically coherent description for image captioning. Modern techniques commonly capitalize on a pre-trained object detector/classifier to mine the semantics in an image, while leaving the inherent linguistic ordering of semantics under-exploited. In this paper, we propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net), that novelly unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture. Technically, we initially utilize a cross-modal retrieval model to search the relevant sentences of each image, and all words in the searched sentences are taken as primary semantic cues. Next, a novel semantic comprehender is devised to filter out the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
MethodsTest
