PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese
Nghia Hieu Nguyen, Kiet Van Nguyen

TL;DR
This paper introduces the Parallel Attention Transformer (PAT) for Vietnamese visual question answering, utilizing novel modules like the Hierarchical Linguistic Features Extractor to improve accuracy on benchmark datasets.
Contribution
The paper proposes the PAT model with a new Parallel Attention mechanism and a Hierarchical Linguistic Features Extractor tailored for Vietnamese VQA tasks.
Findings
Achieved state-of-the-art accuracy on ViVQA dataset
Outperformed baseline models including SAAA and MCAN
Demonstrated effectiveness of the proposed modules
Abstract
We present in this paper a novel scheme for multimodal learning named the Parallel Attention mechanism. In addition, to take into account the advantages of grammar and context in Vietnamese, we propose the Hierarchical Linguistic Features Extractor instead of using an LSTM network to extract linguistic features. Based on these two novel modules, we introduce the Parallel Attention Transformer (PAT), achieving the best accuracy compared to all baselines on the benchmark ViVQA dataset and other SOTA methods including SAAA and MCAN.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Text and Document Classification Technologies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization
