Achieving Human Parity on Visual Question Answering

Ming Yan; Haiyang Xu; Chenliang Li; Junfeng Tian; Bin Bi; Wei Wang,; Weihua Chen; Xianzhe Xu; Fan Wang; Zheng Cao; Zhicheng Zhang; Qiyu Zhang; Ji; Zhang; Songfang Huang; Fei Huang; Luo Si; Rong Jin

arXiv:2111.08896·cs.CL·November 22, 2021·1 cites

Achieving Human Parity on Visual Question Answering

Ming Yan, Haiyang Xu, Chenliang Li, Junfeng Tian, Bin Bi, Wei Wang,, Weihua Chen, Xianzhe Xu, Fan Wang, Zheng Cao, Zhicheng Zhang, Qiyu Zhang, Ji, Zhang, Songfang Huang, Fei Huang, Luo Si, Rong Jin

PDF

Open Access

TL;DR

This paper presents AliceMind-MMU, a VQA system that achieves human-level performance by enhancing visual and textual features, cross-modal attention, and specialized knowledge modules, validated through extensive experiments.

Contribution

The paper introduces a novel VQA pipeline with comprehensive pre-training, advanced cross-modal attention, and expert modules, reaching or surpassing human performance.

Findings

01

Achieves performance comparable to or better than humans on VQA tasks.

02

Demonstrates the effectiveness of specialized expert modules for complex visual questions.

03

Validates the approach through extensive experiments and analysis.

Abstract

The Visual Question Answering (VQA) task utilizes both visual image and language analysis to answer a textual question with respect to an image. It has been a popular research topic with an increasing number of real-world applications in the last decade. This paper describes our recent research of AliceMind-MMU (ALIbaba's Collection of Encoder-decoders from Machine IntelligeNce lab of Damo academy - MultiMedia Understanding) that obtains similar or even slightly better results than human being does on VQA. This is achieved by systematically improving the VQA pipeline including: (1) pre-training with comprehensive visual and textual feature representation; (2) effective cross-modal interaction with learning to attend; and (3) A novel knowledge mining framework with specialized expert modules for the complex VQA task. Treating different types of visual questions with corresponding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning