VQA with Cascade of Self- and Co-Attention Blocks
Aakansha Mishra, Ashish Anand, Prithwijit Guha

TL;DR
This paper introduces a novel VQA model that employs a cascade of self- and co-attention blocks to enhance multi-modal representation through dense visual-textual interactions, improving performance on standard datasets.
Contribution
It proposes a cascade architecture combining self- and co-attention modules for better multi-modal feature learning in VQA tasks.
Findings
Improved accuracy on VQA2.0 and TDIUC datasets.
Key components validated through ablation studies.
Cascading attention modules enhances multi-modal interaction.
Abstract
The use of complex attention modules has improved the performance of the Visual Question Answering (VQA) task. This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities. The proposed model has an attention block containing both self-attention and co-attention on image and text. The self-attention modules provide the contextual information of objects (for an image) and words (for a question) that are crucial for inferring an answer. On the other hand, co-attention aids the interaction of image and text. Further, fine-grained information is obtained from two modalities by using a Cascade of Self- and Co-Attention blocks (CSCA). This proposal is benchmarked on the widely used VQA2.0 and TDIUC datasets. The efficacy of key components of the model and cascading of attention modules are demonstrated by experiments involving…
| Question Type | SAN | RAU | MCB | QTA | BAN | CSCA |
|---|---|---|---|---|---|---|
| [36] | [14] | [11] | [28] | [16] | ||
| Scene Recognition | 92.3 | 93.96 | 93.06 | 93.80 | 93.1 | 94.48 |
| Sport Recognition | 95.5 | 93.47 | 92.77 | 95.55 | 95.7 | 95.85 |
| Color Attributes | 60.9 | 66.86 | 68.54 | 60.16 | 67.5 | 75.51 |
| Other Attributes | 46.2 | 56.49 | 56.72 | 54.36 | 53.2 | 60.89 |
| Activity Recognition | 51.40 | 51.60 | 52.35 | 60.10 | 54.0 | 61.00 |
| Positional Reasoning | 27.9 | 35.26 | 35.40 | 34.71 | 27.9 | 42.14 |
| Object Recognition | 87.50 | 86.11 | 85.54 | 86.98 | 87.5 | 89.11 |
| Absurd | 93.4 | 96.08 | 84.82 | 100.0 | 94.47 | 97.28 |
| Utility & Affordance | 26.3 | 31.58 | 35.09 | 31.48 | 24.0 | 40.35 |
| Object Presence | 92.4 | 94.38 | 93.64 | 94.55 | 95.1 | 96.34 |
| Counting | 52.1 | 48.43 | 51.01 | 53.25 | 53.9 | 60.70 |
| Sentiment Und. | 53.6 | 60.09 | 66.25 | 64.38 | 58.7 | 67.19 |
| Overall Accuracy | 82.0 | 84.26 | 81.86 | 85.03 | 85.5 | 88.12 |
| Harmonic Mean | 53.7 | 59.00 | 60.47 | 60.08 | 54.9 | 67.05 |
| Arithmetic Mean | 65.0 | 67.81 | 67.90 | 69.11 | 67.4 | 73.34 |
| Methods | Val | Test-Dev | Test-Std | ||||||
| Overall | Yes / No | Number | Other | Overall | Overall | ||||
| MCB [8] | 59.14 | 78.46 | 38.28 | 57.80 | 62.27 | 53.36 | |||
| MLB [17] | 62.98 | 83.58 | 44.92 | 56.34 | 66.27 | 66.62 | |||
| MUTAN [4] | 62.71 | 82.88 | 44.54 | 56.50 | 66.01 | 66.38 | |||
| MFH [40] | 62.98 | 84.27 | 49.56 | 59.89 | 68.76 | – | |||
| BLOCK [5] | 64.91 | 83.14 | 51.62 | 58.97 | 68.09 | 68.41 | |||
| SAN [36] | 61.70 | 78.40 | 40.71 | 54.36 | 61.70 | – | |||
| BTUP [1] | 63.20 | 81.82 | 44.21 | 56.05 | 65.32 | 65.67 | |||
| BAN [16] | 65.81 | 82.16 | 45.45 | 55.70 | 64.30 | – | |||
| v-VRANet [37] | – | 83.31 | 45.51 | 58.41 | 67.20 | 67.34 | |||
| ALMA [19] | – | 84.62 | 47.08 | 58.24 | 68.12 | 66.62 | |||
| ODA [41] | 64.23 | 83.73 | 47.02 | 56.57 | 66.67 | 66.87 | |||
| BAN2-CTI [7] | 66.00 | – | – | – | – | 67.4 | |||
| CRANet [25] | – | 83.31 | 45.51 | 58.41 | 67.20 | 67.34 | |||
| CoR [34] | 65.14 | 84.98 | 47.19 | 58.64 | 68.19 | 68.59 | |||
| MUREL [6] | 65.14 | 84.77 | 49.84 | 57.85 | 68.03 | 68.41 | |||
| DFAF [9] | 66.66 | 86.09 | 53.32 | 60.49 | 70.22 | 70.34 | |||
| MLIN [10] | 66.53 | 85.96 | 52.93 | 60.40 | 70.18 | 70.28 | |||
| LXMERT [32] | – | – | – | – | – | 72.5 | |||
| ViLBERT [20] | – | 70.55 | 70.92 | ||||||
| CSCA | 67.36 | 86.57 | 53.58 | 61.06 | 70.72 | 71.04 | |||
| SA | CA | Yes / No | Number | Other | Overall | Parameter |
|---|---|---|---|---|---|---|
| Accuracy | (in Millions) | |||||
| ✗ | ✗ | 69.95 | 36.42 | 50.19 | 55.80 | 22 |
| ✓ | ✗ | 79.08 | 40.75 | 49.96 | 59.69 | 15 |
| ✗ | ✓ | 81.17 | 44.63 | 56.34 | 64.13 | 25 |
| ✓ | ✓ | 84.92 | 49.51 | 58.71 | 67.36 | 42 |
| SA | CA | Overall | Parameter |
|---|---|---|---|
| Accuracy | (in Millions) | ||
| ✗ | ✗ | 69.18 | 7 |
| ✗ | ✓ | 70,46 | 21 |
| ✓ | ✗ | 87.42 | 25 |
| ✓ | ✓ | 88.12 | 36 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
VQA with Cascade of
Self- and Co-Attention Blocks
Aakansha Mishra
Ashish Anand
Prithwijit Guha
Abstract
The use of complex attention modules has improved the performance of the Visual Question Answering (VQA) task. This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities. The proposed model has an attention block containing both self-attention and co-attention on image and text. The self-attention modules provide the contextual information of objects (for an image) and words (for a question) that are crucial for inferring an answer. On the other hand, co-attention aids the interaction of image and text. Further, fine-grained information is obtained from two modalities by using a Cascade of Self- and Co-Attention blocks (CSCA). This proposal is benchmarked on the widely used VQA2.0 and TDIUC datasets. The efficacy of key components of the model and cascading of attention modules are demonstrated by experiments involving ablation analysis.
Keywords: Visual Question Answering, Attention Networks, Self-Attention, Co-attention, Multi-modal Fusion, Classification Networks
1 Introduction
Initial attention-based approaches [35][15][36][1] focused on identifying salient image regions based on the text of a given question. In other words, the focus was on giving attention to images (visual attention) only. Subsequent methods, often referred to as co-attention-based methods [21][4], combined textual attention along with image attention. Textual attention focuses on relevant words in the context of the given image. Co-attention-based methods improved the performance of VQA systems. A few studies [36][22][9][10] have shown that considering attention in a cascaded or stack-based manner helps in obtaining enriched representation with fine-grained information.
Recent attention-based models have taken inspiration from transformer-based models [33] to include self-attention (SA) as well. The SA helps in incorporation of internal correlation within a modality. For text modality, SA encodes internal correlation among words to obtain informative representation of the given sentence. Similarly, for image modality, SA helps in encoding correlation among the salient regions of image. Figure 1 shows an example for illustration. The given question is “What color is the women’s shirt?. Salient regions within image include ‘woman’. It is likely to be informative if the region consisting of ‘women’ keeps the contextual information such as “dress she is wearing”, “hair color” as well as correlation with other salient objects. Here, women’s shirt could be one of the more correlated region with respect to some other salient objects. The SA helps in encoding such information.
Based on the advantages of each of the following modules: SA, co-attention (CA) and cascade of attention mechanisms, this work proposes combining them together in a systematic manner. Towards this objective, the proposed model builds one self- and co-attention based attention block (SCA), that combines both SA and CA in a specific way. For each of the text and image modalities, a specific SA module obtains a feature representation for the respective modality. Then the co-attention module uses self-attended representation of one modality and attends (takes attention) on the self-attended representation of the other modality to obtain cross-modality contextual representation for the second modality. Thus, there are two SA modules (one for each text and image modalities) and two co-attention modules within a single SCA block (Figure 2). In one complex attention block of SCA, both modalities guide itself to capture internal correlation and each other to learn the robust representation of each of the visual and textual domains.
The proposed model exploits the niche attributes of the different attention mechanisms and further combines them together in a dense attention module (SCA block). A Cascade of multiple SCA blocks (CSCA) is used to extract fine-grained information. Figure 2 gives the overview of the SCA block which takes representations of question and image of the block as input, and provides the improved representation of question and image.
To analyse and evaluate the model performance, extensive experiments are performed on two widely used VQA datasets: VQA2.0 [12] and TDIUC [14]. Ablation analysis experiments are also performed to understand the impact of the important components of the proposed model. Primary contributions of this work are:
- •
A dense attention based VQA model comprising of cascaded attention blocks.
- •
The core of each attention block consists of self-attention and co-attention so that the two modalities guide each other to obtain an enriched representation.
- •
Extensive performance evaluation along with ablation analysis of the proposed model on the two benchmark datasets – TDIUC and VQA2.0.
2 Related Work
VQA, being a multimodal task, requires an unified representation of the text and image modalities. Initial VQA models [2, 12, 31, 13] adopted simple fusion based approaches. These models first obtained feature representations of individual modality using corresponding pre-trained networks and then combined them to obtain a joint representation using a fusion schema. Simple fusion schemes include concatenation or element-wise summation or multiplication. Fukui et al. [8] proposed bi-linear pooling to capture interaction of components of the two modalities in a better way. Seeing the advantage of the bilinear pooling based fusion methods, further variants of bilinear pooling with lesser complexity or faster convergence were proposed. MFB [39], MLB [17], MFH [40] were proposed to obtain a representation providing better interaction of the two modalities.
Introduction of attention mechanism in [3] equipped neural models with a systematic procedure to assign relative weights of importance to sequential inputs. Shi et al. in [29] have introduced image attention guided by question to focus on salient image regions relevant to the given question. This helped in obtaining improved feature representations. This led to the development of several attention based approaches for VQA [15][16][36][34][22][35][1]. Studies in [36][34][22] have shown that applying attention multiple times helps in obtaining enriched representation embedded with fine-grained information.
Authors in [21][39] have proposed that attention on textual features in context of visual features along with visual attention plays a key role in VQA models. Such two way attention mechanism is referred to as dual attention or co-attention or cross-modality attention in the literature. We have also used these terms interchangeably. Kim et al. [16] have proposed bilinear interaction based attention for dual modality. Do et al. [7] have proposed an approach by exploiting knowledge distillation with a teacher and student model. Mishra et al. [22] have proposed co-attention based multistage model for VQA. In another work, the authors [23] have proposed question categorization and dual attention for VQA. RAMEN [30] is an unified model that uses high level reasoning and can deal with VQA datasets based on both real-world and synthetic images.
Another class of attention mechanism uses intra-modal attention (self-attention) along with cross-modal attention (co-attention) to learn better feature representation. Gao et.al. [9] have proposed DFAF that combines self-attention and co-attention. Multi-modal Latent Interaction (MLIN) [10] used multi-modal reasoning through summarization, interaction, and aggregation. Yu et al. [38] have proposed an encoder-decoder based dense attention mechanism. These models are relatively dense than the previous approaches and hence, are referred to as dense attention based models. Authors in [20][32] have proposed transformer based attention models for multimodality tasks. These models are pretrained for multiple tasks on huge datasets, that could be further exploited for downstream tasks.
The proposed model falls in the category of dense attention based methods. It uses a cascade of attention blocks to obtain a multi-modal feature representation. Here, each attention block comprises of intra-modality and cross-modality interactions. The proposed method is described next.
3 Proposed Method
The proposed framework treats VQA as an answer classification task following existing works like [1][9][2][12][10]. The input image () and the associated natural language question () are first subjected to feature extraction (Subsection 3.1). Pretrained deep networks are used to extract features from a few salient image regions. The network embeddings are used to represent the input image. Similarly, a pretrained network is used to obtain the word embeddings of the associated input question. These word embeddings collectively represent the input question. The feature embeddings of both image and text modalities are subjected to self-attention mechanism (Subsection 3.2) for capturing the relationships among different regions of and words of . The self-attended representations of these two modalities are further processed by co-attention modules (Subsection 3.3). This single stage of Self and Co-Attention mechanism cascade forms a single SCA block (Figure 2). Multiple SCA blocks are cascaded to obtain further fine grained representations of both modalities. The embeddings obtained from the final SCA block are fused (Subsection 3.4) and fed to the answer classification network (Subsection 3.5) to predict the answer ().
3.1 Feature Extraction
A pretrained deep network based object detection model (Faster R-CNN, [27]) is used to identify the top- salient regions from the input image . The pretrained ResNet-101 [13] network is used to compute the visual feature of each region as an embedding . Thus, the input image is represented as by using number of dimensional ResNet-101 embeddings.
[TABLE]
The input natural language question is first padded and trimmed to a length of words. The word features are further extracted as pretrained GloVe embeddings [26]. Thus, the question is represented as by using number of dimensional embeddings.
[TABLE]
All feature embeddings in and are projected to a common dimensional space to obtain the respective initial feature embedding matrices as and .
[TABLE]
Here, and are the transformation matrices. These representations are provided as input to the self- and co-attention modules.
3.2 Self-Attention
The self-attention (SA) mechanism is one of the key components of the proposed model. It is incorporated for both textual (question as collection of words) and visual (image as top- salient regions) modalities. At the () block, the input to SA are and . Following [33], the SA uses keys and queries, both of dimension and values of dimension respectively. The Multi-Head Attention [33] is incorporated to capture the attention from different aspects. For this, parallel heads are added, where each head is considered to learn the relationships from different view (for image) and context (for question).
Let be a matrix of feature embeddings, where and . For visual features, , and . Similarly, for question features, , and .
The query (), key () and value () matrices for the head can be respectively expressed as follows
[TABLE]
where, , and are transformation matrices. Using , the inner product of query is performed with all the keys and is divided by for more stable gradients [33]. The SoftMax function is applied on the inner product to obtain the attention weights for question words and image salient regions. A scaled inner product based attention is computed for all the heads in the following manner.
[TABLE]
[TABLE]
Here, is the transformation matrix. The output ( ) of multi-head attention module is passed through fully connected feed forward layers with ReLU activation and dropout to prevent overfitting. Further, residual connections [13] followed by layer normalization are applied on top of fully connected layers for faster and more accurate training. The layer normalization is applied over the embedding dimension only. Finally, the self-attended embeddings of the input feature are obtained as where and . Multihead attention mechamism is shown in Figure 4.
3.3 Co-Attention
For cross-modal interactions, the co-attention module intakes the representations of two modalities and generates attention in context of each other. To facilitate this, the self-attended embeddings and are taken as input. For generating image attention in context of question words, keys and values are generated from self-attended intermediate question representation while the query is obtained from the image itself (following Equation 8). Thus, the query (), key () and value () are respectively computed as follows.
[TABLE]
Here, , and are transformation matrices. Similarly, for cross-modal question attention, the query is obtained from self-attended question embeddings. While the keys and values are obtained from self-attended image embeddings. These queries, keys and values are similarly processed following Equations 8 and 9 to obtain the multi-head attention. This is fed to fully connected layers with ReLU, dropout, skip connections and layer normalization. The output of this network provides the final output of the co-attention module. Figure 5 demonstrates the overview of the self-attention and co-attention mechanism followed.
3.4 Cascading & Fusion
A single SCA block comprising of self-attention (intra-modality interaction) and co-attention (inter-modality interaction) generates an enriched representation () of its input visual and textual features.
Existing works [36][22] suggest the stacking of multiple such blocks to obtain further fine grained representations. This is accomplished by cascading multiple SCA block to steps. Let and be the respective visual and question representations obtained from the final () SCA block.
The feature representations are obtained by averaging the attended embeddings of two modalities. So, the final visual embedding, say is obtained as follows.
[TABLE]
Similarly, the question encoding, say is evaluated in the following manner.
[TABLE]
The unified multi-modal representation, say is obtained by fusing and through element-wise multiplication.
[TABLE]
The fused embedding is fed to a fully connected network for answer prediction.
3.5 Answer Prediction
The fused embedding is fed to fully connected network with single hidden layer of dimension . The number of labels at the output layer is (). The output answer vector, say is predicted as follows.
[TABLE]
3.6 Model Learning
Let the respective ground truth and predicted answer be and () for input image and question . This model uses cross-entropy loss for answer prediction and is defined as
[TABLE]
The combined set of parameters for proposed model includes the ones for feature extraction, block of dense attention and fusion mechanism.
4 Experiment Design
This section discusses the datasets used to benchmark the proposed model, the three evaluation metrics and the necessary implementation details.
4.1 Dataset
The proposed model is evaluated through experiments performed on the datasets VQA2.0 [12] and TDIUC [14]. The VQA2.0 [12] dataset is widely used for the VQA task. There are three question categories in VQA2.0. These are ‘Yes/No’ (), ‘Number’ () and ‘Other’ (). The dataset is divided into train, validation and test sets with , and image, question and answer triplets respectively.
The Task-Directed Image Understanding Challenge (TDIUC) [14] is another large VQA dataset of real images. Questions are categorized into types. These are ‘Scene Recognition’ (), ‘Sport Recognition’ (), ‘Color’ (), ‘Other Attributes’ (), ‘Activity Recognition’ (), ‘Positional Reasoning’ (), ‘Object Recognition’ (), ‘Absurd’ (), ‘Utility & Affordance’ (), ‘Object Presence’ (), ‘Counting’ () and ‘Sentiment Understanding’ (). Total million question, image and answer triplets are split into train and validation sets. The train set consists of million triplets and million triplets are in the validation split. To deal with language prior issues, TDIUC consists of a special category ‘Absurd’, where an input question is not related to the visual content of a given image.
4.2 Evaluation Metrics
For evaluation of the TDIUC dataset, Arithmetic-Mean Per Type (AMPT) and Harmonic-Mean Per Type (HMPT) are proposed in [14] as fair evaluation metrics along with Overall Accuracy. The AMPT is the average of question category-wise accuracies with uniform weight to each category. On the other hand, HMPT measures the ability of the model to have a high score across all question types.
The VQA2.0 dataset evaluation is performed using the following metric defined in [2].
[TABLE]
Each question in the VQA2.0 dataset was answered by annotators. The above evaluation metric considers a predicted answer correct if it matches the answers given by at least annotators.
4.3 Implementation Details
Visual feature representation is constructed by extracting (for TDIUC) and (for VQA2.0) image regions. The use of ResNet-101 embeddings provide image region features of dimensions. The question length in terms of number of tokens () is set to by trimming or padding. The GloVe word embeddings of dimensions are considered. The image and word features are projected to same dimensions . For self- and co-attention computations, the key, query and value vector dimensions are set to , i.e., . The model uses heads for multi-head attention. The model is trained for epochs with a batch size of samples for both experiments and analysis. The hidden layer dimension of answer prediction FCNet is set to . The Adamax optimizer [18] is used with a decaying step learning rate. The initial learning rate is set to , and it decays by after every epochs. The proposed model CSCA is built on the PyTorch framework and is trained on NVIDIA-GTX GPU.
5 Results and Discussion
5.1 Quantitative Results
Overall Performance & Category-wise Performance Comparison on TDIUC Dataset – Table 1 and 2 present the respective class-wise and overall performance for the TDIUC dataset. In terms of the overall accuracy, Arithmetic-MPT (AMPT) and Harmonic-MPT (HMPT) measures, the proposed model CSCA exhibits better performance compared to most of the baseline methods. Also, in terms of class-wise accuracy, CSCA leads in all except one class. A significant relative gain of 12.6% is observed compared to the next best performing model for the ‘Counting’ category of questions. Table 3 presents the results for different models trained ‘Without Absurd’ category of questions. It is observed that CSCA performs better than the existing ones for all three defined evaluation metrics.
Overall Performance & Category-wise Performance Comparison on VQA2.0 Dataset – Table 4 demonstrates the results on test-dev and test-std splits of the VQA2.0 dataset. Performance of the proposed model CSCA is comparable with that of the best among the existing methods. The models LXMERT [32], ViLBERT [20] are pre-trained for multiple vision and language based tasks and are fine-tuned for VQA. Here, CSCA has obtained 67.36% accuracy on the validation set. This is around 1% improvement over the best performance among the existing methods.
5.2 Basic Analysis
Effect of Training Data Size on Performance – An analysis is performed to observe the effect of the variation of training dataset size on model performance. The primary objective of this experiment was to ascertain whether a model trained on a smaller dataset can provide similar performance as the one learned from the complete set. To explore this, the model is trained with four different datasets obtained from the original VQA2.0 dataset. The first three datasets are obtained by random shuffling of all samples of the VQA2.0 dataset followed by the extraction of 25%, 50% and 75% samples. The fourth one is the complete VQA2.0 dataset (i.e. 100%). Other experimental setups like hidden dimension, number of answer classes are kept similar to the original setup for all variants of the dataset. The Epoch-wise performances for the four different datasets are shown in Figure 6(a). As expected, the model performance improved with an increase in training dataset size. It can be observed that in all four settings, the model performance evolves over a different number of epochs in a similar fashion. However, Figure 6(b) indicates that the relative gain achieved by increasing the training dataset size from to is significant compared to that by increasing from to or to . This observation may be attributed to the fact that in a collection of randomly shuffled datasets, not many novel instances were encountered during the subsequent increase of the training data.
**Effect of Number of SCA Blocks – ** In one pass, it is difficult for a model to grasp all relevant information through a representation. Thus, attention blocks in cascade extract the fine-grained information and pass it on to the next one for further refinement. A set of experiments are performed to identify the optimal number of blocks in the cascade. Additionally, the effect of different independent attention mechanisms (SA only, CA only, SCA) for answer prediction is also analyzed. In Figure 7(a), overall performance for validation split of VQA2.0 dataset is given with respect to varying number of blocks. Figure 7(b) shows the parameter counts with respect to the number of blocks. As per expectation, it is observed that the models perform poorly with single attention blocks (SA only, CA only, SCA). However, the performance is observed to rise only up to four number of blocks. Increasing the number of blocks beyond four does not lead to any further performance improvement. However, adding more blocks also lead to an increase in the number of model parameters (Figure 7(b). Furthermore, one can observe that only CA module can perform better than using only the SA module. This is as per the expectation. Similarly, Figure 8 shows that the model performance keeps improving until the fourth SCA block for the TDIUC dataset. The model performance starts deteriorating with a further increase in the number of blocks.
5.3 Ablation Analysis
The proposed model performs self-attention on the two modalities to obtain intra-modality correlated features. Then the co-attention module uses respective representations of the two modalities to obtain cross-modality correlated features by performing attention for one modality in the context of another. In this ablation analysis, we examine the impact of individual attention module in various combinations to understand their importance. We also analyze the set of correct predictions obtained in these settings.
Table 5 and 6 present the results of ablation analysis experiments in terms of performance and complexity. The complexity is expressed in terms of the number of model parameters. The first row of the table shows the model performance when neither of the attention is incorporated. The features for both modalities are fused directly via element-wise multiplication without applying self- or co-attention. Second row shows the performance when only self-attention (SA only) is incorporated on both modalities and answer prediction is based on the fused embedding of the self-attended representations of the individual modalities. Here, the fused representation is obtained via element-wise multiplication. Third row shows the results when only co-attention (CA only) is incorporated on image and question in the context of the other. The last row shows the results from the proposed model that comprises of both self-attention and co-attention in cascade (SCA).
As per expectation, the model without any attention mechanism provides the lowest performance (first row). The “SA only” model provides lower performance as it lacks the interaction of two modalities and learns a comparatively poor representation (second row). Co-attention is the crucial component for multi-modality that is found to perform better than self-attention. In terms of computational complexity, a simple fusion-based model uses the least number of parameters, while the proposed model (SCA) requires the highest number of parameters. However, the performance improvement, especially for VQA2.0 dataset, overcomes the complexity issue. We observe that the change in model performance is similar for both datasets in this analysis.
Figure 9 shows the model’s performance over various attention mechanisms for the different types of questions category on VQA2.0 dataset. The following are observed from the results for the ‘Number’ category of questions. While using the SA only and CA only blocks, the respective models show the overall performances of and . Models using SA and CA attention individually predicts of samples correctly that are not correctly classified by any of the other models. Similarly, the model using SCA block classifies of samples correctly that are not correctly classified either by the models using SA or CA only. Thus, the models using SCA blocks achieved the best performance. The same pattern was observed over the other questions types i.e., ‘Yes/No’ and ‘Other’. The detailed result for all the question types are shown in figure 9.
5.4 Qualitative Results
The qualitative results are presented in Figure 10 to demonstrate the efficacy of the proposed model. For this, two salient regions of a given image with the highest attention scores are highlighted. These are the attention scores obtained after cascading SCA blocks. The question words that obtain the highest attention scores are also highlighted. As evident from Figure 10(a), the proposed model CSCA is able to focus on relevant image regions and question words. The top-2 salient regions corresponding to the binary question “Are there any cows in the picture?” are the ones that capture the cows and hence, the model responds by the answer ‘Yes’. Similarly, Figures 10(b) 10(c) 10(d) 10(e) 10(f) 10(g) 10(h) show that the model is trying to identify the salient image regions and relevant question words to predict the appropriate answer.
However, the model made errors as well. One of the reasons was incorrect attention to image regions. As shown in Figure 11(a), the model’s focus is primarily on the position from where it seems like this room is a kitchen. If the attention is given to other regions, the answer will likely to change to ‘living room’. In 11(b) for question ‘What color is the wall in back of the desk ?’, the model focuses on the other side of the desk instead of the back. The predicted answer is ‘green’, the color on the side-wall of the desk.
6 Conclusion
This work proposes a dense attention mechanism-based VQA model. Dense attention is incorporated by exploiting both self-attention and co-attention. The self-attention mechanism helps in obtaining improved representation within a single modality. With self-attention, a salient region (in the case of image) interacts with every other region. The final representation inherits the contextual information for all regions. Similarly, for the input questions, self-attention provides the representation of every single word that captures the contextual information for other words as well. The proposed model also exploits the cross-modal interaction of two modalities which is further strengthened by self-attention of two modalities. Attention blocks are cascaded multiple times to facilitate refined cues of visual and textual features. The model’s capability is justified by detailed experiments and analysis performed on the two benchmark VQA datasets.
The proposed method can be extended in several ways. The present proposal may be subjected to bias and consistency analysis. For example, this may be performed by rephrasing questions and flipping (or rotating) associated images. Also, the current proposal can be extended with translated question-answer pairs to validate its applicability in multi-lingual VQA.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. Bottom-up and Top-down Attention for Image Captioning and Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 6077–6086.
- 2[2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. VQA: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 2425–2433.
- 3[3] Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. 3rd International Conference on Learning Representations, ICLR 2015 ; Conference date: 07-05-2015 Through 09-05-2015.
- 4[4] Ben-Younes, H., Cadene, R., Cord, M., and Thome, N. MUTAN: Multimodal Tucker Fusion for Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 2612–2620.
- 5[5] Ben-Younes, H., Cadene, R., Thome, N., and Cord, M. BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (2019), vol. 33, pp. 8102–8109.
- 6[6] Cadene, R., Ben-Younes, H., Cord, M., and Thome, N. Murel: Multimodal relational reasoning for visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2019), pp. 1989–1998.
- 7[7] Do, Tuong and Do, Thanh-Toan and Tran, Huy and Tjiputra, Erman and Tran, Quang D . Compact Trilinear Interaction for Visual Question Answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision (2019), pp. 392–401.
- 8[8] Fukui, A., Park, D. H., Yang, D., Rohrbach, A., Darrell, T., and Rohrbach, M. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (Austin, Texas, Nov. 2016), pp. 457–468.
