TL;DR
This paper introduces LaConv, a dynamic, language-dependent convolution module, and LaConvNet, a fully language-driven network that unifies visual recognition and multi-modal reasoning, demonstrating superior performance on vision-and-language tasks.
Contribution
The paper proposes LaConv, a novel dynamic convolution module guided by natural language, and LaConvNet, the first fully language-driven convolutional network for multi-modal visual recognition.
Findings
LaConv outperforms existing multi-modal modules in experiments.
LaConvNet achieves high generalization and performance gains.
LaConvNet shows +4.7% improvement on RefCOCO+.
Abstract
In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-dependent Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on four benchmark datasets of two vision-and-language tasks, i.e., visual question answering (VQA) and referring expression comprehension (REC). The experimental…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsConvolution
