LaKo: Knowledge-driven Visual Question Answering via Late   Knowledge-to-Text Injection

Zhuo Chen; Yufeng Huang; Jiaoyan Chen; Yuxia Geng; Yin Fang; Jeff Pan,; Ningyu Zhang; Wen Zhang

arXiv:2207.12888·cs.CV·November 29, 2022·1 cites

LaKo: Knowledge-driven Visual Question Answering via Late Knowledge-to-Text Injection

Zhuo Chen, Yufeng Huang, Jiaoyan Chen, Yuxia Geng, Yin Fang, Jeff Pan,, Ningyu Zhang, Wen Zhang

PDF

Open Access 1 Repo

TL;DR

LaKo introduces a novel knowledge-driven VQA approach that transforms knowledge graph triples into text and employs late knowledge injection, significantly improving state-of-the-art performance on the OKVQA dataset.

Contribution

The paper proposes a late knowledge-to-text injection mechanism for integrating external knowledge graphs into VQA models, enhancing reasoning capabilities.

Findings

01

Achieves state-of-the-art results on OKVQA dataset.

02

Effectively incorporates structured knowledge via textual transformation.

03

Demonstrates improved reasoning in visual question answering.

Abstract

Visual question answering (VQA) often requires an understanding of visual concepts and language semantics, which relies on external knowledge. Most existing methods exploit pre-trained language models or/and unstructured text, but the knowledge in these resources are often incomplete and noisy. Some other methods prefer to use knowledge graphs (KGs) which often have intensive structured knowledge, but the research is still quite preliminary. In this paper, we propose LaKo, a knowledge-driven VQA method via Late Knowledge-to-text Injection. To effectively incorporate an external KG, we transfer triples into textual format and propose a late injection mechanism for knowledge fusion. Finally we address VQA as a text generation task with an effective encoder-decoder paradigm, which achieves state-of-the-art results on OKVQA dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hackerchenzhuo/LaKo
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition