Dual-Level Collaborative Transformer for Image Captioning
Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue, Huang, Chia-Wen Lin, Rongrong Ji

TL;DR
This paper introduces a Dual-Level Collaborative Transformer that combines region and grid features for image captioning, using novel attention mechanisms and geometric alignment to improve contextual understanding and achieve state-of-the-art results.
Contribution
The paper proposes a novel DLCT network with dual-way self-attention and geometric alignment for better feature fusion in image captioning.
Findings
Achieves new state-of-the-art CIDEr-D score of 133.8% on Karpathy split.
Demonstrates effective fusion of region and grid features.
Outperforms existing methods on MS-COCO dataset.
Abstract
Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which in contrast are the merits of traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. Concretely, in DLCT, these two features are first processed by a novelDual-way Self Attenion (DWSA) to mine their intrinsic properties, where a Comprehensive Relation Attention component is also introduced to embed the geometric information. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features, where a geometric alignment graph is constructed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Byte Pair Encoding · Multi-Head Attention · Softmax · Layer Normalization · Dropout
