Multi-modality Latent Interaction Network for Visual Question Answering
Peng Gao, Haoxuan You, Zhanpeng Zhang, Xiaogang Wang, Hongsheng Li

TL;DR
This paper introduces the Multi-modality Latent Interaction module (MLI) for VQA, which models cross-modality relationships between summarized visual and language features, leading to improved performance on benchmark datasets.
Contribution
The paper proposes a novel MLI module that captures latent cross-modality relations through summarizations, enhancing VQA performance beyond existing relation modeling methods.
Findings
Achieves competitive results on VQA v2.0 and TDIUC datasets.
Stacking MLI modules models complex relations effectively.
Combining with BERT further improves accuracy.
Abstract
Exploiting relationships between visual regions and question words have achieved great success in learning multi-modality features for Visual Question Answering (VQA). However, we argue that existing methods mostly model relations between individual visual regions and words, which are not enough to correctly answer the question. From humans' perspective, answering a visual question requires understanding the summarizations of visual and language information. In this paper, we proposed the Multi-modality Latent Interaction module (MLI) to tackle this problem. The proposed module learns the cross-modality relationships between latent visual and language summarizations, which summarize visual regions and question into a small number of latent representations to avoid modeling uninformative individual region-word relations. The cross-modality information between the latent summarizations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Weight Decay · Residual Connection · Adam · Layer Normalization · Softmax · Attention Is All You Need · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention
