Learning to Collocate Visual-Linguistic Neural Modules for Image   Captioning

Xu Yang; Hanwang Zhang; Chongyang Gao; Jianfei Cai

arXiv:2210.01338·cs.CV·April 25, 2023

Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning

Xu Yang, Hanwang Zhang, Chongyang Gao, Jianfei Cai

PDF

Open Access 1 Repo

TL;DR

This paper introduces a modular neural network approach for image captioning that dynamically collocates visual and linguistic modules, achieving state-of-the-art results and improved robustness over existing methods.

Contribution

The paper proposes a novel visual-linguistic modular network with dynamic module collocation, a syntax-based regularization, and demonstrates superior performance on MS-COCO.

Findings

01

Achieved a new state-of-the-art CIDEr-D score of 129.5 on MS-COCO.

02

The model is less prone to overfitting and performs better with fewer training samples.

03

Demonstrated robustness and effectiveness of the modular design in image captioning.

Abstract

Humans tend to decompose a sentence into different parts like \textsc{sth do sth at someplace} and then fill each part with certain content. Inspired by this, we follow the \textit{principle of modular design} to propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike the \re{widely used} neural module networks in VQA, where the language (\ie, question) is fully observable, \re{the task of collocating visual-linguistic modules is more challenging.} This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning. To sum up, we make the following technical contributions to design and train our CVLNM: 1) \textit{distinguishable module design} -- \re{four modules in the encoder} including one linguistic module for function words and three visual modules…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gcyzsl/cvlmn
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques