Progressive Tree-Structured Prototype Network for End-to-End Image Captioning
Pengpeng Zeng, Jinkuan Zhu, Jingkuan Song, Lianli Gao

TL;DR
This paper introduces a novel Progressive Tree-Structured Prototype Network (PTSN) for end-to-end image captioning, leveraging hierarchical textual semantics to improve prediction accuracy and achieve state-of-the-art results on MSCOCO.
Contribution
The paper proposes the first hierarchical semantic modeling approach for image captioning using tree-structured prototypes and a progressive aggregation module.
Findings
Achieves new state-of-the-art CIDEr scores on MSCOCO dataset.
Demonstrates effective modeling of hierarchical textual semantics.
Improves captioning performance with hierarchical prototypes.
Abstract
Studies of image captioning are shifting towards a trend of a fully end-to-end paradigm by leveraging powerful visual pre-trained models and transformer-based generation architecture for more flexible model training and faster inference speed. State-of-the-art approaches simply extract isolated concepts or attributes to assist description generation. However, such approaches do not consider the hierarchical semantic structure in the textual domain, which leads to an unpredictable mapping between visual representations and concept words. To this end, we propose a novel Progressive Tree-Structured prototype Network (dubbed PTSN), which is the first attempt to narrow down the scope of prediction words with appropriate semantics by modeling the hierarchical textual semantics. Specifically, we design a novel embedding method called tree-structured prototype, producing a set of hierarchical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
MethodsTest
