Efficient Modeling of Future Context for Image Captioning
Zhengcong Fei, Junshi Huang, Xiaoming Wei, Xiaolin Wei

TL;DR
This paper introduces a novel method to incorporate future context into autoregressive image captioning models by leveraging ideas from non-autoregressive models, resulting in improved captioning performance without additional inference cost.
Contribution
It proposes a training framework that enables autoregressive models to utilize future context effectively, combining shared visual encoders and a teacher-student paradigm for enhanced captioning.
Findings
Outperforms state-of-the-art baselines on MS COCO
Improves automatic metrics and human evaluation scores
Maintains inference efficiency without extra time cost
Abstract
Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
