Dual-Level Collaborative Transformer for Image Captioning

Yunpeng Luo; Jiayi Ji; Xiaoshuai Sun; Liujuan Cao; Yongjian Wu; Feiyue; Huang; Chia-Wen Lin; Rongrong Ji

arXiv:2101.06462·cs.CV·August 4, 2021·24 cites

Dual-Level Collaborative Transformer for Image Captioning

Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Liujuan Cao, Yongjian Wu, Feiyue, Huang, Chia-Wen Lin, Rongrong Ji

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a Dual-Level Collaborative Transformer that combines region and grid features for image captioning, using novel attention mechanisms and geometric alignment to improve contextual understanding and achieve state-of-the-art results.

Contribution

The paper proposes a novel DLCT network with dual-way self-attention and geometric alignment for better feature fusion in image captioning.

Findings

01

Achieves new state-of-the-art CIDEr-D score of 133.8% on Karpathy split.

02

Demonstrates effective fusion of region and grid features.

03

Outperforms existing methods on MS-COCO dataset.

Abstract

Descriptive region features extracted by object detection networks have played an important role in the recent advancements of image captioning. However, they are still criticized for the lack of contextual information and fine-grained details, which in contrast are the merits of traditional grid features. In this paper, we introduce a novel Dual-Level Collaborative Transformer (DLCT) network to realize the complementary advantages of the two features. Concretely, in DLCT, these two features are first processed by a novelDual-way Self Attenion (DWSA) to mine their intrinsic properties, where a Comprehensive Relation Attention component is also introduced to embed the geometric information. In addition, we propose a Locality-Constrained Cross Attention module to address the semantic noises caused by the direct fusion of these two features, where a geometric alignment graph is constructed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luo3300612/image-captioning-DLCT
pytorchOfficial

Videos

Dual-Level Collaborative Transformer for Image Captioning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Attention Is All You Need · Byte Pair Encoding · Multi-Head Attention · Softmax · Layer Normalization · Dropout