VQA with Cascade of Self- and Co-Attention Blocks

Aakansha Mishra; Ashish Anand; Prithwijit Guha

arXiv:2302.14777·cs.CV·March 1, 2023

VQA with Cascade of Self- and Co-Attention Blocks

Aakansha Mishra, Ashish Anand, Prithwijit Guha

PDF

Open Access

TL;DR

This paper introduces a novel VQA model that employs a cascade of self- and co-attention blocks to enhance multi-modal representation through dense visual-textual interactions, improving performance on standard datasets.

Contribution

It proposes a cascade architecture combining self- and co-attention modules for better multi-modal feature learning in VQA tasks.

Findings

01

Improved accuracy on VQA2.0 and TDIUC datasets.

02

Key components validated through ablation studies.

03

Cascading attention modules enhances multi-modal interaction.

Abstract

The use of complex attention modules has improved the performance of the Visual Question Answering (VQA) task. This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities. The proposed model has an attention block containing both self-attention and co-attention on image and text. The self-attention modules provide the contextual information of objects (for an image) and words (for a question) that are crucial for inferring an answer. On the other hand, co-attention aids the interaction of image and text. Further, fine-grained information is obtained from two modalities by using a Cascade of Self- and Co-Attention blocks (CSCA). This proposal is benchmarked on the widely used VQA2.0 and TDIUC datasets. The efficacy of key components of the model and cascading of attention modules are demonstrated by experiments involving…

Tables6

Table 1. Table 1 : Category-wise comparison of CSCA with previous state-of-the-art methods on the TDIUC dataset

Question Type	SAN	RAU	MCB	QTA	BAN	CSCA
	[36]	[14]	[11]	[28]	[16]
Scene Recognition	92.3	93.96	93.06	93.80	93.1	94.48
Sport Recognition	95.5	93.47	92.77	95.55	95.7	95.85
Color Attributes	60.9	66.86	68.54	60.16	67.5	75.51
Other Attributes	46.2	56.49	56.72	54.36	53.2	60.89
Activity Recognition	51.40	51.60	52.35	60.10	54.0	61.00
Positional Reasoning	27.9	35.26	35.40	34.71	27.9	42.14
Object Recognition	87.50	86.11	85.54	86.98	87.5	89.11
Absurd	93.4	96.08	84.82	100.0	94.47	97.28
Utility & Affordance	26.3	31.58	35.09	31.48	24.0	40.35
Object Presence	92.4	94.38	93.64	94.55	95.1	96.34
Counting	52.1	48.43	51.01	53.25	53.9	60.70
Sentiment Und.	53.6	60.09	66.25	64.38	58.7	67.19
Overall Accuracy	82.0	84.26	81.86	85.03	85.5	88.12
Harmonic Mean	53.7	59.00	60.47	60.08	54.9	67.05
Arithmetic Mean	65.0	67.81	67.90	69.11	67.4	73.34

Table 2. Table 2 : Comparing Overall Accuracy of CSCA for TDIUC dataset

Model	Overall Accuracy	Arithmetic Mean
BTUP[1]	82.91	68.82
QCG[24]	82.05	65.67
BAN2-CTI[7]	87.00	72.5
DFAF[9]	85.55	NA
RAMEN[30]	86.86	72.52
MLIN[10]	87.60	NA
CSCA	88.12	73.34

Table 3. Table 3 : Performance of CSCA on TDIUC data (except Absurd category samples) trained without ‘Absurd’ Category samples

Metrics	MCB	QTA	BAN	BAN2-CTI	CSCA
	[11]	[28]	[16]	[7]
Overall Accuracy	78.06	80.95	81.9	85.0	85.30
Arithmetic-MPT	66.07	66.88	64.6	70.6	71.21
Harmonic-MPT	55.43	58.82	52.8	63.8	65.40

Table 4. Table 4 : Model performance on VQA 2.0 dataset: Validation, Test-Dev & Test-Std splits. CSCA is compared with several state-of-the-art methods including Fusion based , Visual Attention , Dense Attention based methods separated with lines.

Methods	Val	Test-Dev				Test-Std
Methods	Overall	Yes / No	Number	Other	Overall	Overall
MCB [8]	59.14	78.46	38.28	57.80	62.27	53.36
MLB [17]	62.98	83.58	44.92	56.34	66.27	66.62
MUTAN [4]	62.71	82.88	44.54	56.50	66.01	66.38
MFH [40]	62.98	84.27	49.56	59.89	68.76	–
BLOCK [5]	64.91	83.14	51.62	58.97	68.09	68.41
SAN [36]	61.70	78.40	40.71	54.36	61.70	–
BTUP [1]	63.20	81.82	44.21	56.05	65.32	65.67
BAN [16]	65.81	82.16	45.45	55.70	64.30	–
v-VRANet [37]	–	83.31	45.51	58.41	67.20	67.34
ALMA [19]	–	84.62	47.08	58.24	68.12	66.62
ODA [41]	64.23	83.73	47.02	56.57	66.67	66.87
BAN2-CTI [7]	66.00	–	–	–	–	67.4
CRANet [25]	–	83.31	45.51	58.41	67.20	67.34
CoR [34]	65.14	84.98	47.19	58.64	68.19	68.59
MUREL [6]	65.14	84.77	49.84	57.85	68.03	68.41
DFAF [9]	66.66	86.09	53.32	60.49	70.22	70.34
MLIN [10]	66.53	85.96	52.93	60.40	70.18	70.28
LXMERT [32]	–	–	–	–	–	72.5
ViLBERT [20]	–				70.55	70.92
CSCA	67.36	86.57	53.58	61.06	70.72	71.04

Table 5. Table 5 : Evaluating model performance on VQA2.0 dataset to investigate the effect of different basic attention modules of the proposed model

SA	CA	Yes / No	Number	Other	Overall	Parameter
					Accuracy	(in Millions)
✗	✗	69.95	36.42	50.19	55.80	22
✓	✗	79.08	40.75	49.96	59.69	15
✗	✓	81.17	44.63	56.34	64.13	25
✓	✓	84.92	49.51	58.71	67.36	42

Table 6. Table 6 : Evaluating model performance on TDIUC dataset to investigate the effect of number of attention blocks and self-attention & cross attention.

SA	CA	Overall	Parameter
		Accuracy	(in Millions)
✗	✗	69.18	7
✗	✓	70,46	21
✓	✗	87.42	25
✓	✓	88.12	36

Equations31

rI = [r_{1}, \dots r_{n_{v}}]; r \in R^{d_{v}}

rI = [r_{1}, \dots r_{n_{v}}]; r \in R^{d_{v}}

E_{q} = [eq_{1}, \dots eq_{n_{w}}]; eq \in R^{d_{w}}

E_{q} = [eq_{1}, \dots eq_{n_{w}}]; eq \in R^{d_{w}}

rI (0)

rI (0)

Eq (0)

Q_{S}^{(i)}

Q_{S}^{(i)}

K_{S}^{(i)}

V_{S}^{(i)}

H_{i} = (V_{S}^{(i)}) SoftMax \frac{Q _{S}^{(i)} ^{⊤} K _{S}^{(i)}}{d _{K}}

H_{i} = (V_{S}^{(i)}) SoftMax \frac{Q _{S}^{(i)} ^{⊤} K _{S}^{(i)}}{d _{K}}

MH (E_{M}) = W_{mh} H

MH (E_{M}) = W_{mh} H

Q_{C}^{(i)}

Q_{C}^{(i)}

K_{C}^{(i)}

V_{C}^{(i)}

I_{f} = \frac{1}{k} j = 1 \sum k rI (T) [:, j]

I_{f} = \frac{1}{k} j = 1 \sum k rI (T) [:, j]

Q_{f} = \frac{1}{n _{w}} j = 1 \sum n_{w} E_{q} (T) [:, j]

Q_{f} = \frac{1}{n _{w}} j = 1 \sum n_{w} E_{q} (T) [:, j]

F = I_{f} ⊙ Q_{f}

F = I_{f} ⊙ Q_{f}

\hat{a} = FCNet (F; d_{h p}; n_{c})

\hat{a} = FCNet (F; d_{h p}; n_{c})

L_{c} = - j = 1 \sum n_{c} a [j] l o g (\overset{a}{^} [j])

L_{c} = - j = 1 \sum n_{c} a [j] l o g (\overset{a}{^} [j])

\textbf{Accuracy}{(\mathbf{\hat{a}})}=min\Big{\{}\frac{\textbf{\#humans that said $\mathbf{\hat{a}}$ }}{\textbf{3}},\textbf{1}\Big{\}}

\textbf{Accuracy}{(\mathbf{\hat{a}})}=min\Big{\{}\frac{\textbf{\#humans that said $\mathbf{\hat{a}}$ }}{\textbf{3}},\textbf{1}\Big{\}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

Full text

VQA with Cascade of

Self- and Co-Attention Blocks

Aakansha Mishra

[email protected]

Ashish Anand

[email protected]

Prithwijit Guha

[email protected]

Abstract

The use of complex attention modules has improved the performance of the Visual Question Answering (VQA) task. This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities. The proposed model has an attention block containing both self-attention and co-attention on image and text. The self-attention modules provide the contextual information of objects (for an image) and words (for a question) that are crucial for inferring an answer. On the other hand, co-attention aids the interaction of image and text. Further, fine-grained information is obtained from two modalities by using a Cascade of Self- and Co-Attention blocks (CSCA). This proposal is benchmarked on the widely used VQA2.0 and TDIUC datasets. The efficacy of key components of the model and cascading of attention modules are demonstrated by experiments involving ablation analysis.

Keywords: Visual Question Answering, Attention Networks, Self-Attention, Co-attention, Multi-modal Fusion, Classification Networks

1 Introduction

Initial attention-based approaches [35][15][36][1] focused on identifying salient image regions based on the text of a given question. In other words, the focus was on giving attention to images (visual attention) only. Subsequent methods, often referred to as co-attention-based methods [21][4], combined textual attention along with image attention. Textual attention focuses on relevant words in the context of the given image. Co-attention-based methods improved the performance of VQA systems. A few studies [36][22][9][10] have shown that considering attention in a cascaded or stack-based manner helps in obtaining enriched representation with fine-grained information.

Recent attention-based models have taken inspiration from transformer-based models [33] to include self-attention (SA) as well. The SA helps in incorporation of internal correlation within a modality. For text modality, SA encodes internal correlation among words to obtain informative representation of the given sentence. Similarly, for image modality, SA helps in encoding correlation among the salient regions of image. Figure 1 shows an example for illustration. The given question is “What color is the women’s shirt?. Salient regions within image include ‘woman’. It is likely to be informative if the region consisting of ‘women’ keeps the contextual information such as “dress she is wearing”, “hair color” as well as correlation with other salient objects. Here, women’s shirt could be one of the more correlated region with respect to some other salient objects. The SA helps in encoding such information.

Based on the advantages of each of the following modules: SA, co-attention (CA) and cascade of attention mechanisms, this work proposes combining them together in a systematic manner. Towards this objective, the proposed model builds one self- and co-attention based attention block (SCA), that combines both SA and CA in a specific way. For each of the text and image modalities, a specific SA module obtains a feature representation for the respective modality. Then the co-attention module uses self-attended representation of one modality and attends (takes attention) on the self-attended representation of the other modality to obtain cross-modality contextual representation for the second modality. Thus, there are two SA modules (one for each text and image modalities) and two co-attention modules within a single SCA block (Figure 2). In one complex attention block of SCA, both modalities guide itself to capture internal correlation and each other to learn the robust representation of each of the visual and textual domains.

The proposed model exploits the niche attributes of the different attention mechanisms and further combines them together in a dense attention module (SCA block). A Cascade of multiple SCA blocks (CSCA) is used to extract fine-grained information. Figure 2 gives the overview of the $t^{\text{th}}$ SCA block which takes representations of question and image of the $(t-1)^{\text{th}}$ block as input, and provides the improved representation of question and image.

To analyse and evaluate the model performance, extensive experiments are performed on two widely used VQA datasets: VQA2.0 [12] and TDIUC [14]. Ablation analysis experiments are also performed to understand the impact of the important components of the proposed model. Primary contributions of this work are:

•

A dense attention based VQA model comprising of cascaded attention blocks.

•

The core of each attention block consists of self-attention and co-attention so that the two modalities guide each other to obtain an enriched representation.

•

Extensive performance evaluation along with ablation analysis of the proposed model on the two benchmark datasets – TDIUC and VQA2.0.

2 Related Work

VQA, being a multimodal task, requires an unified representation of the text and image modalities. Initial VQA models [2, 12, 31, 13] adopted simple fusion based approaches. These models first obtained feature representations of individual modality using corresponding pre-trained networks and then combined them to obtain a joint representation using a fusion schema. Simple fusion schemes include concatenation or element-wise summation or multiplication. Fukui et al. [8] proposed bi-linear pooling to capture interaction of components of the two modalities in a better way. Seeing the advantage of the bilinear pooling based fusion methods, further variants of bilinear pooling with lesser complexity or faster convergence were proposed. MFB [39], MLB [17], MFH [40] were proposed to obtain a representation providing better interaction of the two modalities.

Introduction of attention mechanism in [3] equipped neural models with a systematic procedure to assign relative weights of importance to sequential inputs. Shi et al. in [29] have introduced image attention guided by question to focus on salient image regions relevant to the given question. This helped in obtaining improved feature representations. This led to the development of several attention based approaches for VQA [15][16][36][34][22][35][1]. Studies in [36][34][22] have shown that applying attention multiple times helps in obtaining enriched representation embedded with fine-grained information.

Authors in [21][39] have proposed that attention on textual features in context of visual features along with visual attention plays a key role in VQA models. Such two way attention mechanism is referred to as dual attention or co-attention or cross-modality attention in the literature. We have also used these terms interchangeably. Kim et al. [16] have proposed bilinear interaction based attention for dual modality. Do et al. [7] have proposed an approach by exploiting knowledge distillation with a teacher and student model. Mishra et al. [22] have proposed co-attention based multistage model for VQA. In another work, the authors [23] have proposed question categorization and dual attention for VQA. RAMEN [30] is an unified model that uses high level reasoning and can deal with VQA datasets based on both real-world and synthetic images.

Another class of attention mechanism uses intra-modal attention (self-attention) along with cross-modal attention (co-attention) to learn better feature representation. Gao et.al. [9] have proposed DFAF that combines self-attention and co-attention. Multi-modal Latent Interaction (MLIN) [10] used multi-modal reasoning through summarization, interaction, and aggregation. Yu et al. [38] have proposed an encoder-decoder based dense attention mechanism. These models are relatively dense than the previous approaches and hence, are referred to as dense attention based models. Authors in [20][32] have proposed transformer based attention models for multimodality tasks. These models are pretrained for multiple tasks on huge datasets, that could be further exploited for downstream tasks.

The proposed model falls in the category of dense attention based methods. It uses a cascade of attention blocks to obtain a multi-modal feature representation. Here, each attention block comprises of intra-modality and cross-modality interactions. The proposed method is described next.

3 Proposed Method

The proposed framework treats VQA as an answer classification task following existing works like [1][9][2][12][10]. The input image $I$ ( $I\in\mathbf{\mathcal{I}}$ ) and the associated natural language question $q$ ( $q\in\mathbf{\mathcal{Q}}$ ) are first subjected to feature extraction (Subsection 3.1). Pretrained deep networks are used to extract features from a few salient image regions. The network embeddings are used to represent the input image. Similarly, a pretrained network is used to obtain the word embeddings of the associated input question. These word embeddings collectively represent the input question. The feature embeddings of both image and text modalities are subjected to self-attention mechanism (Subsection 3.2) for capturing the relationships among different regions of $I$ and words of $q$ . The self-attended representations of these two modalities are further processed by co-attention modules (Subsection 3.3). This single stage of Self and Co-Attention mechanism cascade forms a single SCA block (Figure 2). Multiple SCA blocks are cascaded to obtain further fine grained representations of both modalities. The embeddings obtained from the final SCA block are fused (Subsection 3.4) and fed to the answer classification network (Subsection 3.5) to predict the answer $\hat{a}$ ( $\hat{a}\in\mathbf{\mathcal{A}}$ ).

3.1 Feature Extraction

A pretrained deep network based object detection model (Faster R-CNN, [27]) is used to identify the top- $n_{v}$ salient regions from the input image $I$ . The pretrained ResNet-101 [13] network is used to compute the visual feature of each region as an embedding $\mathbf{r}\in\mathbb{R}^{d_{v}}$ . Thus, the input image $I$ is represented as $\mathbf{rI}\in\mathbb{R}^{d_{v}\times n_{v}}$ by using $n_{v}$ number of $d_{v}$ dimensional ResNet-101 embeddings.

[TABLE]

The input natural language question $q$ is first padded and trimmed to a length of $n_{w}$ words. The word features are further extracted as pretrained GloVe embeddings $\mathbf{eq}\in\mathbb{R}^{d_{w}}$ [26]. Thus, the question $q$ is represented as $\mathbf{E_{q}}\in\mathbb{R}^{d_{w}\times n_{w}}$ by using $n_{w}$ number of $d_{w}$ dimensional embeddings.

[TABLE]

All feature embeddings in $\mathbf{rI}$ and $\mathbf{E_{q}}$ are projected to a common $d$ dimensional space to obtain the respective initial feature embedding matrices as $\mathbf{rI}(0)$ and $\mathbf{E_{q}}(0)$ .

[TABLE]

Here, $W_{c}^{I}\in\mathbb{R}^{d\times d_{v}}$ and $W_{c}^{Q}\in\mathbb{R}^{d\times d_{w}}$ are the transformation matrices. These representations are provided as input to the self- and co-attention modules.

3.2 Self-Attention

The self-attention (SA) mechanism is one of the key components of the proposed model. It is incorporated for both textual (question as collection of words) and visual (image as top- $n_{v}$ salient regions) modalities. At the $t^{\text{th}}$ ( $t=1,\ldots T$ ) block, the input to SA are $\mathbf{rI}(t-1)$ and $\mathbf{E_{q}}(t-1)$ . Following [33], the SA uses keys and queries, both of dimension $d_{KQ}$ and values of dimension $d_{VS}$ respectively. The Multi-Head Attention [33] is incorporated to capture the attention from different aspects. For this, $n_{h}$ parallel heads are added, where each head is considered to learn the relationships from different view (for image) and context (for question).

Let $\mathbf{E_{M}}=\{\mathbf{em}_{1}\ldots\mathbf{em}_{l}\}$ be a matrix of feature embeddings, where $\mathbf{em}\in\mathbb{R}^{d_{m}}$ and $\mathbf{E_{M}}\in\mathbb{R}^{d_{m}\times l}$ . For visual features, $\mathbf{E_{M}}=\mathbf{rI}(t-1)$ , $l=n_{v}$ and $d_{m}=d$ . Similarly, for question features, $\mathbf{E_{M}}=\mathbf{E_{q}}(t-1)$ , $l=n_{w}$ and $d_{m}=d$ .

The query ( $Q_{S}^{(i)}$ ), key ( $K_{S}^{(i)}$ ) and value ( $V_{S}^{(i)}$ ) matrices for the $i^{\text{th}}$ head can be respectively expressed as follows

[TABLE]

where, $W^{QS}_{i}\in\mathbb{R}^{d_{m}\times d_{KQ}}$ , $W^{KS}_{i}\in\mathbb{R}^{d_{m}\times d_{KQ}}$ and $W^{VS}_{i}\in\mathbb{R}^{d_{m}\times d_{VS}}$ are transformation matrices. Using $\{Q_{S}^{(i)},K_{S}^{(i)},V_{S}^{(i)}\}$ , the inner product of query is performed with all the keys and is divided by $\sqrt[]{d_{k}}$ for more stable gradients [33]. The SoftMax function is applied on the inner product to obtain the attention weights for question words and image salient regions. A scaled inner product based attention is computed for all the heads in the following manner.

[TABLE]

Here, $W_{mh}\in\mathbb{R}^{d_{m}\times(n_{h}\times d_{VS})}$ is the transformation matrix. The output ( $\mathrm{MH}(\mathbf{E_{M}})$ ) of multi-head attention module is passed through fully connected feed forward layers with ReLU activation and dropout to prevent overfitting. Further, residual connections [13] followed by layer normalization are applied on top of fully connected layers for faster and more accurate training. The layer normalization is applied over the embedding dimension only. Finally, the self-attended embeddings of the input feature $\mathbf{E_{M}}$ are obtained as $\mathbf{SE_{M}}=\{\mathbf{sem}_{1}\ldots\mathbf{sem}_{l}\}$ where $\mathbf{sem}\in\mathbb{R}^{d_{m}}$ and $\mathbf{SE_{M}}\in\mathbb{R}^{d_{m}\times l}$ . Multihead attention mechamism is shown in Figure 4.

3.3 Co-Attention

For cross-modal interactions, the co-attention module intakes the representations of two modalities and generates attention in context of each other. To facilitate this, the self-attended embeddings $\widetilde{\mathbf{E_{q}}}(t-1)$ and $\widetilde{\mathbf{rI}}(t-1)$ are taken as input. For generating image attention in context of question words, keys and values are generated from self-attended intermediate question representation while the query is obtained from the image itself (following Equation 8). Thus, the query ( $Q_{C}^{(i)}$ ), key ( $K_{C}^{(i)}$ ) and value ( $V_{C}^{(i)}$ ) are respectively computed as follows.

[TABLE]

Here, $W_{i}^{QC}\in\mathbb{R}^{d_{m}\times d_{KQ}}$ , $W_{i}^{KC}\in\mathbb{R}^{d_{m}\times d_{KQ}}$ and $W_{i}^{VC}\in\mathbb{R}^{d_{m}\times d_{KV}}$ are transformation matrices. Similarly, for cross-modal question attention, the query is obtained from self-attended question embeddings. While the keys and values are obtained from self-attended image embeddings. These queries, keys and values are similarly processed following Equations 8 and 9 to obtain the multi-head attention. This is fed to fully connected layers with ReLU, dropout, skip connections and layer normalization. The output of this network provides the final output of the co-attention module. Figure 5 demonstrates the overview of the self-attention and co-attention mechanism followed.

3.4 Cascading & Fusion

A single SCA block comprising of self-attention (intra-modality interaction) and co-attention (inter-modality interaction) generates an enriched representation ( $\mathbf{rI}(t),\mathbf{E_{q}}(t)$ ) of its input visual and textual features.

Existing works [36][22] suggest the stacking of multiple such blocks to obtain further fine grained representations. This is accomplished by cascading multiple SCA block to $T$ steps. Let $\mathbf{rI}(T)\in\mathbb{R}^{d\times n_{v}}$ and $\mathbf{E_{q}}(T)\in\mathbb{R}^{d\times n_{w}}$ be the respective visual and question representations obtained from the final ( $T^{\text{th}}$ ) SCA block.

The feature representations are obtained by averaging the attended embeddings of two modalities. So, the final visual embedding, say $\mathbf{I}_{f}$ is obtained as follows.

[TABLE]

Similarly, the question encoding, say $\mathbf{Q}_{f}$ is evaluated in the following manner.

[TABLE]

The unified multi-modal representation, say $\mathbf{F}\in\mathbb{R}^{d}$ is obtained by fusing $\mathbf{I}_{f}$ and $\mathbf{Q}_{f}$ through element-wise multiplication.

[TABLE]

The fused embedding $\mathbf{F}$ is fed to a fully connected network for answer prediction.

3.5 Answer Prediction

The fused embedding $\mathbf{F}$ is fed to fully connected network with single hidden layer of dimension $d_{hp}$ . The number of labels at the output layer is $n_{c}$ ( $n_{c}=\mid\mathbf{\mathcal{A}}\mid$ ). The output answer vector, say $\mathbf{\hat{a}}$ is predicted as follows.

[TABLE]

3.6 Model Learning

Let the respective ground truth and predicted answer be $a$ and $\hat{a}$ ( $a,\hat{a}\in\mathbf{\mathcal{A}}$ ) for input image $I$ and question $Q$ . This model uses cross-entropy loss for answer prediction and is defined as

[TABLE]

The combined set of parameters for proposed model includes the ones for feature extraction, block of dense attention and fusion mechanism.

4 Experiment Design

This section discusses the datasets used to benchmark the proposed model, the three evaluation metrics and the necessary implementation details.

4.1 Dataset

The proposed model is evaluated through experiments performed on the datasets VQA2.0 [12] and TDIUC [14]. The VQA2.0 [12] dataset is widely used for the VQA task. There are three question categories in VQA2.0. These are ‘Yes/No’ ( $37.6\%$ ), ‘Number’ ( $13.03\%$ ) and ‘Other’ ( $49.37\%$ ). The dataset is divided into train, validation and test sets with $443757$ , $214354$ and $447793$ image, question and answer triplets respectively.

The Task-Directed Image Understanding Challenge (TDIUC) [14] is another large VQA dataset of real images. Questions are categorized into $12$ types. These are ‘Scene Recognition’ ( $4.03\%$ ), ‘Sport Recognition’ ( $1.91\%$ ), ‘Color’ ( $11.82\%$ ), ‘Other Attributes’ ( $1.73\%$ ), ‘Activity Recognition’ ( $0.52\%$ ), ‘Positional Reasoning’ ( $2.32\%$ ), ‘Object Recognition’ ( $5.66\%$ ), ‘Absurd’ ( $22.16\%$ ), ‘Utility & Affordance’ ( $0.03\%$ ), ‘Object Presence’ ( $39.73\%$ ), ‘Counting’ ( $9.96\%$ ) and ‘Sentiment Understanding’ ( $0.13\%$ ). Total $1.6$ million question, image and answer triplets are split into train and validation sets. The train set consists of $1.1$ million triplets and $0.5$ million triplets are in the validation split. To deal with language prior issues, TDIUC consists of a special category ‘Absurd’, where an input question is not related to the visual content of a given image.

4.2 Evaluation Metrics

For evaluation of the TDIUC dataset, Arithmetic-Mean Per Type (AMPT) and Harmonic-Mean Per Type (HMPT) are proposed in [14] as fair evaluation metrics along with Overall Accuracy. The AMPT is the average of question category-wise accuracies with uniform weight to each category. On the other hand, HMPT measures the ability of the model to have a high score across all question types.

The VQA2.0 dataset evaluation is performed using the following metric defined in [2].

[TABLE]

Each question in the VQA2.0 dataset was answered by $10$ annotators. The above evaluation metric considers a predicted answer correct if it matches the answers given by at least $3$ annotators.

4.3 Implementation Details

Visual feature representation $\mathbf{rI}$ is constructed by extracting $n_{v}=36$ (for TDIUC) and $n_{v}=100$ (for VQA2.0) image regions. The use of ResNet-101 embeddings provide image region features of $d_{v}=2048$ dimensions. The question length in terms of number of tokens ( $n_{w}$ ) is set to $14$ by trimming or padding. The GloVe word embeddings of $d_{w}=300$ dimensions are considered. The image and word features are projected to same dimensions $d=512$ . For self- and co-attention computations, the key, query and value vector dimensions are set to $64$ , i.e., $d_{KQ}=d_{VS}=64$ . The model uses $n_{h}=8$ heads for multi-head attention. The model is trained for $15$ epochs with a batch size of $64$ samples for both experiments and analysis. The hidden layer dimension of answer prediction FCNet is set to $d_{hp}=1024$ . The Adamax optimizer [18] is used with a decaying step learning rate. The initial learning rate is set to $\mathrm{0.002}$ , and it decays by $\mathrm{0.1}$ after every $\mathrm{5}$ epochs. The proposed model CSCA is built on the PyTorch framework and is trained on NVIDIA-GTX $\mathrm{1080}$ GPU.

5 Results and Discussion

5.1 Quantitative Results

Overall Performance & Category-wise Performance Comparison on TDIUC Dataset – Table 1 and 2 present the respective class-wise and overall performance for the TDIUC dataset. In terms of the overall accuracy, Arithmetic-MPT (AMPT) and Harmonic-MPT (HMPT) measures, the proposed model CSCA exhibits better performance compared to most of the baseline methods. Also, in terms of class-wise accuracy, CSCA leads in all except one class. A significant relative gain of 12.6% is observed compared to the next best performing model for the ‘Counting’ category of questions. Table 3 presents the results for different models trained ‘Without Absurd’ category of questions. It is observed that CSCA performs better than the existing ones for all three defined evaluation metrics.

Overall Performance & Category-wise Performance Comparison on VQA2.0 Dataset – Table 4 demonstrates the results on test-dev and test-std splits of the VQA2.0 dataset. Performance of the proposed model CSCA is comparable with that of the best among the existing methods. The models LXMERT [32], ViLBERT [20] are pre-trained for multiple vision and language based tasks and are fine-tuned for VQA. Here, CSCA has obtained 67.36% accuracy on the validation set. This is around 1% improvement over the best performance among the existing methods.

5.2 Basic Analysis

Effect of Training Data Size on Performance – An analysis is performed to observe the effect of the variation of training dataset size on model performance. The primary objective of this experiment was to ascertain whether a model trained on a smaller dataset can provide similar performance as the one learned from the complete set. To explore this, the model is trained with four different datasets obtained from the original VQA2.0 dataset. The first three datasets are obtained by random shuffling of all samples of the VQA2.0 dataset followed by the extraction of 25%, 50% and 75% samples. The fourth one is the complete VQA2.0 dataset (i.e. 100%). Other experimental setups like hidden dimension, number of answer classes are kept similar to the original setup for all variants of the dataset. The Epoch-wise performances for the four different datasets are shown in Figure 6(a). As expected, the model performance improved with an increase in training dataset size. It can be observed that in all four settings, the model performance evolves over a different number of epochs in a similar fashion. However, Figure 6(b) indicates that the relative gain achieved by increasing the training dataset size from $25\%$ to $50\%$ is significant compared to that by increasing from $50\%$ to $75\%$ or $75\%$ to $100\%$ . This observation may be attributed to the fact that in a collection of randomly shuffled datasets, not many novel instances were encountered during the subsequent increase of the training data.

**Effect of Number of SCA Blocks – ** In one pass, it is difficult for a model to grasp all relevant information through a representation. Thus, attention blocks in cascade extract the fine-grained information and pass it on to the next one for further refinement. A set of experiments are performed to identify the optimal number of blocks in the cascade. Additionally, the effect of different independent attention mechanisms (SA only, CA only, SCA) for answer prediction is also analyzed. In Figure 7(a), overall performance for validation split of VQA2.0 dataset is given with respect to varying number of blocks. Figure 7(b) shows the parameter counts with respect to the number of blocks. As per expectation, it is observed that the models perform poorly with single attention blocks (SA only, CA only, SCA). However, the performance is observed to rise only up to four number of blocks. Increasing the number of blocks beyond four does not lead to any further performance improvement. However, adding more blocks also lead to an increase in the number of model parameters (Figure 7(b). Furthermore, one can observe that only CA module can perform better than using only the SA module. This is as per the expectation. Similarly, Figure 8 shows that the model performance keeps improving until the fourth SCA block for the TDIUC dataset. The model performance starts deteriorating with a further increase in the number of blocks.

5.3 Ablation Analysis

The proposed model performs self-attention on the two modalities to obtain intra-modality correlated features. Then the co-attention module uses respective representations of the two modalities to obtain cross-modality correlated features by performing attention for one modality in the context of another. In this ablation analysis, we examine the impact of individual attention module in various combinations to understand their importance. We also analyze the set of correct predictions obtained in these settings.

Table 5 and 6 present the results of ablation analysis experiments in terms of performance and complexity. The complexity is expressed in terms of the number of model parameters. The first row of the table shows the model performance when neither of the attention is incorporated. The features for both modalities are fused directly via element-wise multiplication without applying self- or co-attention. Second row shows the performance when only self-attention (SA only) is incorporated on both modalities and answer prediction is based on the fused embedding of the self-attended representations of the individual modalities. Here, the fused representation is obtained via element-wise multiplication. Third row shows the results when only co-attention (CA only) is incorporated on image and question in the context of the other. The last row shows the results from the proposed model that comprises of both self-attention and co-attention in cascade (SCA).

As per expectation, the model without any attention mechanism provides the lowest performance (first row). The “SA only” model provides lower performance as it lacks the interaction of two modalities and learns a comparatively poor representation (second row). Co-attention is the crucial component for multi-modality that is found to perform better than self-attention. In terms of computational complexity, a simple fusion-based model uses the least number of parameters, while the proposed model (SCA) requires the highest number of parameters. However, the performance improvement, especially for VQA2.0 dataset, overcomes the complexity issue. We observe that the change in model performance is similar for both datasets in this analysis.

Figure 9 shows the model’s performance over various attention mechanisms for the different types of questions category on VQA2.0 dataset. The following are observed from the results for the ‘Number’ category of questions. While using the SA only and CA only blocks, the respective models show the overall performances of $65\%$ and $73\%$ . Models using SA and CA attention individually predicts $7\%$ of samples correctly that are not correctly classified by any of the other models. Similarly, the model using SCA block classifies $12\%$ of samples correctly that are not correctly classified either by the models using SA or CA only. Thus, the models using SCA blocks achieved the best performance. The same pattern was observed over the other questions types i.e., ‘Yes/No’ and ‘Other’. The detailed result for all the question types are shown in figure 9.

5.4 Qualitative Results

The qualitative results are presented in Figure 10 to demonstrate the efficacy of the proposed model. For this, two salient regions of a given image with the highest attention scores are highlighted. These are the attention scores obtained after cascading $T=4$ SCA blocks. The question words that obtain the highest attention scores are also highlighted. As evident from Figure 10(a), the proposed model CSCA is able to focus on relevant image regions and question words. The top-2 salient regions corresponding to the binary question “Are there any cows in the picture?” are the ones that capture the cows and hence, the model responds by the answer ‘Yes’. Similarly, Figures 10(b) 10(c) 10(d) 10(e) 10(f) 10(g) 10(h) show that the model is trying to identify the salient image regions and relevant question words to predict the appropriate answer.

However, the model made errors as well. One of the reasons was incorrect attention to image regions. As shown in Figure 11(a), the model’s focus is primarily on the position from where it seems like this room is a kitchen. If the attention is given to other regions, the answer will likely to change to ‘living room’. In 11(b) for question ‘What color is the wall in back of the desk ?’, the model focuses on the other side of the desk instead of the back. The predicted answer is ‘green’, the color on the side-wall of the desk.

6 Conclusion

This work proposes a dense attention mechanism-based VQA model. Dense attention is incorporated by exploiting both self-attention and co-attention. The self-attention mechanism helps in obtaining improved representation within a single modality. With self-attention, a salient region (in the case of image) interacts with every other region. The final representation inherits the contextual information for all regions. Similarly, for the input questions, self-attention provides the representation of every single word that captures the contextual information for other words as well. The proposed model also exploits the cross-modal interaction of two modalities which is further strengthened by self-attention of two modalities. Attention blocks are cascaded multiple times to facilitate refined cues of visual and textual features. The model’s capability is justified by detailed experiments and analysis performed on the two benchmark VQA datasets.

The proposed method can be extended in several ways. The present proposal may be subjected to bias and consistency analysis. For example, this may be performed by rephrasing questions and flipping (or rotating) associated images. Also, the current proposal can be extended with translated question-answer pairs to validate its applicability in multi-lingual VQA.

Bibliography41

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6077–6086.
2[2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2425–2433.
3[3] Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.
4[4] Ben-Younes, H., Cadene, R., Cord, M., and Thome, N. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2612–2620.
5[5] Ben-Younes, H., Cadene, R., Thome, N., and Cord, M. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (2019), vol. 33, pp. 8102–8109.
6[6] Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 1989–1998.
7[7] Do, Tuong and Do, Thanh-Toan and Tran, Huy and Tjiputra, Erman and Tran, Quang D . Compact Trilinear Interaction for Visual Question Answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), pp. 392–401.
8[8] Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (Austin, Texas, Nov. 2016), pp. 457–468.