Modulating early visual processing by language
Harm de Vries, Florian Strub, J\'er\'emie Mary, Hugo Larochelle,, Olivier Pietquin, Aaron Courville

TL;DR
This paper introduces MOdulated RESnet (MRN), a novel method that conditions the entire visual processing pipeline on linguistic input, leading to improved performance in visual question answering tasks.
Contribution
It proposes a new approach to modulate early visual processing using language embeddings by adjusting batch normalization parameters in a pretrained ResNet.
Findings
Modulating early visual stages improves VQA performance.
Language-conditioned batch normalization enhances visual processing.
The method outperforms strong baselines on two VQA datasets.
Abstract
It is commonly assumed that language refers to high-level visual concepts while leaving low-level visual processing unaffected. This view dominates the current literature in computational models for language-vision tasks, where visual and linguistic input are mostly processed independently before being fused into a single representation. In this paper, we deviate from this classic pipeline and propose to modulate the \emph{entire visual processing} by linguistic input. Specifically, we condition the batch normalization parameters of a pretrained residual network (ResNet) on a language embedding. This approach, which we call MOdulated RESnet (\MRN), significantly improves strong baselines on two visual question answering tasks. Our ablation study shows that modulating from the early stages of the visual processing is beneficial.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsDense Connections · Sigmoid Activation · Tanh Activation · Feedforward Network · Long Short-Term Memory · Adam · Modulated Residual Network · Conditional Batch Normalization · Average Pooling · *Communicated@Fast*How Do I Communicate to Expedia?
