Towards Language-guided Visual Recognition via Dynamic Convolutions

Gen Luo; Yiyi Zhou; Xiaoshuai Sun; Yongjian Wu; Yue Gao; Rongrong Ji

arXiv:2110.08797·cs.CV·September 15, 2023

Towards Language-guided Visual Recognition via Dynamic Convolutions

Gen Luo, Yiyi Zhou, Xiaoshuai Sun, Yongjian Wu, Yue Gao, Rongrong Ji

PDF

1 Repo

TL;DR

This paper introduces LaConv, a dynamic, language-dependent convolution module, and LaConvNet, a fully language-driven network that unifies visual recognition and multi-modal reasoning, demonstrating superior performance on vision-and-language tasks.

Contribution

The paper proposes LaConv, a novel dynamic convolution module guided by natural language, and LaConvNet, the first fully language-driven convolutional network for multi-modal visual recognition.

Findings

01

LaConv outperforms existing multi-modal modules in experiments.

02

LaConvNet achieves high generalization and performance gains.

03

LaConvNet shows +4.7% improvement on RefCOCO+.

Abstract

In this paper, we are committed to establishing an unified and end-to-end multi-modal network via exploring the language-guided visual recognition. To approach this target, we first propose a novel multi-modal convolution module called Language-dependent Convolution (LaConv). Its convolution kernels are dynamically generated based on natural language information, which can help extract differentiated visual features for different multi-modal examples. Based on the LaConv module, we further build the first fully language-driven convolution network, termed as LaConvNet, which can unify the visual recognition and multi-modal reasoning in one forward structure. To validate LaConv and LaConvNet, we conduct extensive experiments on four benchmark datasets of two vision-and-language tasks, i.e., visual question answering (VQA) and referring expression comprehension (REC). The experimental…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

luogen1996/laconvnet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution