Learning to Collocate Visual-Linguistic Neural Modules for Image Captioning
Xu Yang, Hanwang Zhang, Chongyang Gao, Jianfei Cai

TL;DR
This paper introduces a modular neural network approach for image captioning that dynamically collocates visual and linguistic modules, achieving state-of-the-art results and improved robustness over existing methods.
Contribution
The paper proposes a novel visual-linguistic modular network with dynamic module collocation, a syntax-based regularization, and demonstrates superior performance on MS-COCO.
Findings
Achieved a new state-of-the-art CIDEr-D score of 129.5 on MS-COCO.
The model is less prone to overfitting and performs better with fewer training samples.
Demonstrated robustness and effectiveness of the modular design in image captioning.
Abstract
Humans tend to decompose a sentence into different parts like \textsc{sth do sth at someplace} and then fill each part with certain content. Inspired by this, we follow the \textit{principle of modular design} to propose a novel image captioner: learning to Collocate Visual-Linguistic Neural Modules (CVLNM). Unlike the \re{widely used} neural module networks in VQA, where the language (\ie, question) is fully observable, \re{the task of collocating visual-linguistic modules is more challenging.} This is because the language is only partially observable, for which we need to dynamically collocate the modules during the process of image captioning. To sum up, we make the following technical contributions to design and train our CVLNM: 1) \textit{distinguishable module design} -- \re{four modules in the encoder} including one linguistic module for function words and three visual modules…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
