PAT: Parallel Attention Transformer for Visual Question Answering in   Vietnamese

Nghia Hieu Nguyen; Kiet Van Nguyen

arXiv:2307.08247·cs.CL·July 18, 2023

PAT: Parallel Attention Transformer for Visual Question Answering in Vietnamese

Nghia Hieu Nguyen, Kiet Van Nguyen

PDF

Open Access

TL;DR

This paper introduces the Parallel Attention Transformer (PAT) for Vietnamese visual question answering, utilizing novel modules like the Hierarchical Linguistic Features Extractor to improve accuracy on benchmark datasets.

Contribution

The paper proposes the PAT model with a new Parallel Attention mechanism and a Hierarchical Linguistic Features Extractor tailored for Vietnamese VQA tasks.

Findings

01

Achieved state-of-the-art accuracy on ViVQA dataset

02

Outperformed baseline models including SAAA and MCAN

03

Demonstrated effectiveness of the proposed modules

Abstract

We present in this paper a novel scheme for multimodal learning named the Parallel Attention mechanism. In addition, to take into account the advantages of grammar and context in Vietnamese, we propose the Hierarchical Linguistic Features Extractor instead of using an LSTM network to extract linguistic features. Based on these two novel modules, we introduce the Parallel Attention Transformer (PAT), achieving the best accuracy compared to all baselines on the benchmark ViVQA dataset and other SOTA methods including SAAA and MCAN.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Text and Document Classification Technologies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Absolute Position Encodings · Adam · Layer Normalization