Towards Local Visual Modeling for Image Captioning
Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

TL;DR
This paper introduces LSTNet, a novel transformer-based model that enhances image captioning by focusing on local visual features through new attention and fusion mechanisms, leading to improved performance on benchmark datasets.
Contribution
The paper proposes LSTNet with Locality-Sensitive Attention and Fusion, advancing local visual modeling in image captioning beyond existing methods.
Findings
LSTNet outperforms state-of-the-art models on MS-COCO with 134.8 CIDEr score.
LSTNet demonstrates strong generalization on Flickr8k and Flickr30k datasets.
The model effectively captures local visual details for more accurate captions.
Abstract
In this paper, we study the local visual modeling with grid features for image captioning, which is critical for generating accurate and detailed captions. To achieve this target, we propose a Locality-Sensitive Transformer Network (LSTNet) with two novel designs, namely Locality-Sensitive Attention (LSA) and Locality-Sensitive Fusion (LSF). LSA is deployed for the intra-layer interaction in Transformer via modeling the relationship between each grid and its neighbors. It reduces the difficulty of local object recognition during captioning. LSF is used for inter-layer information fusion, which aggregates the information of different encoder layers for cross-layer semantical complementarity. With these two novel designs, the proposed LSTNet can model the local visual information of grid features to improve the captioning quality. To validate LSTNet, we conduct extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Layer Normalization · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Softmax
