TL;DR
This paper introduces a context-aware visual policy network for image captioning that explicitly models visual context over time, improving the ability to generate more accurate and contextually rich captions.
Contribution
It proposes a novel visual policy network that considers previous visual attention as context, enhancing sequence-level image captioning beyond traditional attention mechanisms.
Findings
Achieves state-of-the-art results on MS-COCO dataset.
Effectively models complex visual compositions over time.
Improves caption quality by incorporating visual context.
Abstract
Many vision-language tasks can be reduced to the problem of sequence prediction for natural language output. In particular, recent advances in image captioning use deep reinforcement learning (RL) to alleviate the "exposure bias" during training: ground-truth subsequence is exposed in every step prediction, which introduces bias in test when only predicted subsequence is seen. However, existing RL-based image captioning methods only focus on the language policy while not the visual policy (e.g., visual attention), and thus fail to capture the visual context that are crucial for compositional reasoning such as visual relationships (e.g., "man riding horse") and comparisons (e.g., "smaller cat"). To fill the gap, we propose a Context-Aware Visual Policy network (CAVP) for sequence-level image captioning. At every time step, CAVP explicitly accounts for the previous visual attentions as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
