TL;DR
This paper introduces a compact bidirectional transformer for image captioning that leverages both past and future context simultaneously, achieving state-of-the-art results.
Contribution
It proposes a novel architecture that couples bidirectional flows into a single model, enabling parallel decoding and improved captioning performance.
Findings
The bidirectional architecture outperforms unidirectional models on MSCOCO.
Sentence-level ensemble significantly enhances caption quality.
The model achieves new state-of-the-art results without vision-language pretraining.
Abstract
Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
