Efficient Attention Mechanism for Visual Dialog that can Handle All the   Interactions between Multiple Inputs

Van-Quang Nguyen; Masanori Suganuma; and Takayuki Okatani

arXiv:1911.11390·cs.CV·July 20, 2020

Efficient Attention Mechanism for Visual Dialog that can Handle All the Interactions between Multiple Inputs

Van-Quang Nguyen, Masanori Suganuma, and Takayuki Okatani

PDF

1 Repo

TL;DR

This paper introduces a lightweight Transformer architecture designed to efficiently handle multiple inputs in visual dialog tasks, achieving improved performance with fewer parameters compared to traditional models.

Contribution

The proposed Light-weight Transformer for Many Inputs (LTMI) effectively manages interactions among multiple inputs with significantly fewer parameters, enhancing visual dialog modeling.

Findings

01

Improved NDCG scores on VisDial v1.0 dataset.

02

Achieved higher scores with fewer parameters.

03

Validated effectiveness through extensive experiments.

Abstract

It has been a primary concern in recent studies of vision and language tasks to design an effective attention mechanism dealing with interactions between the two modalities. The Transformer has recently been extended and applied to several bi-modal tasks, yielding promising results. For visual dialog, it becomes necessary to consider interactions between three or more inputs, i.e., an image, a question, and a dialog history, or even its individual dialog components. In this paper, we present a neural architecture named Light-weight Transformer for Many Inputs (LTMI) that can efficiently deal with all the interactions between multiple such inputs in visual dialog. It has a block structure similar to the Transformer and employs the same design of attention computation, whereas it has only a small number of parameters, yet has sufficient representational power for the purpose. Assuming a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

davidnvq/visdial
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax