Prophet Attention: Predicting Attention with Future Attention for Image Captioning
Fenglin Liu, Xuancheng Ren, Xian Wu, Wei Fan, Yuexian Zou, Xu Sun

TL;DR
Prophet Attention introduces a future-aware attention mechanism for image captioning that uses future information during training to improve grounding accuracy and caption quality, achieving state-of-the-art results.
Contribution
The paper proposes Prophet Attention, a novel method that leverages future information to regularize attention weights, enhancing grounding and captioning performance in image captioning models.
Findings
Outperforms baselines on Flickr30k and MSCOCO datasets.
Achieves new state-of-the-art results on benchmark datasets.
Secures first place on MSCOCO leaderboard for CIDEr-c40.
Abstract
Recently, attention based models have been used extensively in many sequence-to-sequence learning systems. Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of the current input to attend to the image regions. Under this setting, these attention models have a "deviated focus" problem that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the "ideal" attention weights towards image regions. These calculated "ideal" weights are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
