Cross Modification Attention Based Deliberation Model for Image Captioning
Zheng Lian, Yanan Zhang, Haichang Li, Rui Wang, Xiaohui Hu

TL;DR
This paper introduces a two-pass image captioning framework with a novel Cross Modification Attention module, improving caption quality by refining drafts and leveraging future context, outperforming single-pass models.
Contribution
It proposes a universal two-pass decoding framework with CMA for enhanced semantic understanding and error correction in image captioning, a novel approach compared to traditional single-pass models.
Findings
Significant performance improvements over single-pass baselines.
Competitive results with state-of-the-art two-pass methods.
Effective error filtering and semantic enhancement via CMA.
Abstract
The conventional encoder-decoder framework for image captioning generally adopts a single-pass decoding process, which predicts the target descriptive sentence word by word in temporal order. Despite the great success of this framework, it still suffers from two serious disadvantages. Firstly, it is unable to correct the mistakes in the predicted words, which may mislead the subsequent prediction and result in error accumulation problem. Secondly, such a framework can only leverage the already generated words but not the possible future words, and thus lacks the ability of global planning on linguistic information. To overcome these limitations, we explore a universal two-pass decoding framework, where a single-pass decoding based model serving as the Drafting Model first generates a draft caption according to an input image, and a Deliberation Model then performs the polishing process…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
