Towards Local Visual Modeling for Image Captioning

Yiwei Ma; Jiayi Ji; Xiaoshuai Sun; Yiyi Zhou; Rongrong Ji

arXiv:2302.06098·cs.CV·February 14, 2023

Towards Local Visual Modeling for Image Captioning

Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

PDF

Open Access 1 Repo

TL;DR

This paper introduces LSTNet, a novel transformer-based model that enhances image captioning by focusing on local visual features through new attention and fusion mechanisms, leading to improved performance on benchmark datasets.

Contribution

The paper proposes LSTNet with Locality-Sensitive Attention and Fusion, advancing local visual modeling in image captioning beyond existing methods.

Findings

01

LSTNet outperforms state-of-the-art models on MS-COCO with 134.8 CIDEr score.

02

LSTNet demonstrates strong generalization on Flickr8k and Flickr30k datasets.

03

The model effectively captures local visual details for more accurate captions.

Abstract

In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning quality. To validate LSTNet, we conduct extensive experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xmu-xiaoma666/lstnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Layer Normalization · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax