Image Captioning via Compact Bidirectional Architecture

Zijie Song; Yuanen Zhou; Zhenzhen Hu; Daqing Liu; Huixia Ben; Richang Hong; Meng Wang

arXiv:2201.01984·cs.CV·April 8, 2026

Image Captioning via Compact Bidirectional Architecture

Zijie Song, Yuanen Zhou, Zhenzhen Hu, Daqing Liu, Huixia Ben, Richang Hong, Meng Wang

PDF

1 Repo

TL;DR

This paper introduces a compact bidirectional transformer for image captioning that leverages both past and future context simultaneously, achieving state-of-the-art results.

Contribution

It proposes a novel architecture that couples bidirectional flows into a single model, enabling parallel decoding and improved captioning performance.

Findings

01

The bidirectional architecture outperforms unidirectional models on MSCOCO.

02

Sentence-level ensemble significantly enhances caption quality.

03

The model achieves new state-of-the-art results without vision-language pretraining.

Abstract

Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

YuanEZhou/cbtic
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.