ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning
Jia Huei Tan, Ying Hua Tan, Chee Seng Chan, Joon Huang Chuah

TL;DR
ACORT introduces three parameter reduction techniques for Transformer-based image captioning, achieving significantly smaller models that maintain competitive performance on MS-COCO, thus enabling more efficient image captioning systems.
Contribution
The paper proposes a novel combination of parameter reduction methods—Radix Encoding, cross-layer, and attention sharing—for Transformer models in image captioning.
Findings
Models are 3.7x to 21.6x smaller than baselines.
Achieve CIDEr scores >=126 on MS-COCO.
Maintain competitive performance despite parameter reduction.
Abstract
Recent research that applies Transformer-based architectures to image captioning has resulted in state-of-the-art image captioning performance, capitalising on the success of Transformers on natural language tasks. Unfortunately, though these models work well, one major flaw is their large model sizes. To this end, we present three parameter reduction methods for image captioning Transformers: Radix Encoding, cross-layer parameter sharing, and attention parameter sharing. By combining these methods, our proposed ACORT models have 3.7x to 21.6x fewer parameters than the baseline model without compromising test performance. Results on the MS-COCO dataset demonstrate that our ACORT models are competitive against baselines and SOTA approaches, with CIDEr score >=126. Finally, we present qualitative results and ablation studies to demonstrate the efficacy of the proposed changes further.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
