Uncertainty-Aware Image Captioning
Zhengcong Fei, Mingyuan Fan, Li Zhu, Junshi Huang, Xiaoming Wei,, Xiaolin Wei

TL;DR
This paper introduces an uncertainty-aware image captioning framework that iteratively inserts words based on their uncertainty levels, improving caption quality and decoding speed through a non-autoregressive hierarchy.
Contribution
It proposes a novel parallel and iterative captioning method that accounts for word uncertainty, enhancing explainability and efficiency over traditional sequential models.
Findings
Outperforms baseline in caption quality on MS COCO
Achieves faster decoding with logarithmic time complexity
Provides more explainable caption generation process
Abstract
It is well believed that the higher uncertainty in a word of the caption, the more inter-correlated context information is required to determine it. However, current image captioning methods usually consider the generation of all words in a sentence sequentially and equally. In this paper, we propose an uncertainty-aware image captioning framework, which parallelly and iteratively operates insertion of discontinuous candidate words between existing words from easy to difficult until converged. We hypothesize that high-uncertainty words in a sentence need more prior information to make a correct decision and should be produced at a later stage. The resulting non-autoregressive hierarchy makes the caption generation explainable and intuitive. Specifically, we utilize an image-conditioned bag-of-word model to measure the word uncertainty and apply a dynamic programming algorithm to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Video Analysis and Summarization
