Select, Substitute, Search: A New Benchmark for Knowledge-Augmented   Visual Question Answering

Aman Jain; Mayank Kothyari; Vishwajeet Kumar; Preethi Jyothi; Ganesh; Ramakrishnan; Soumen Chakrabarti

arXiv:2103.05568·cs.CV·August 11, 2021

Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering

Aman Jain, Mayank Kothyari, Vishwajeet Kumar, Preethi Jyothi, Ganesh, Ramakrishnan, Soumen Chakrabarti

PDF

1 Repo

TL;DR

This paper introduces a new benchmark and dataset for knowledge-augmented visual question answering focusing on the select, substitute, and search idiom, addressing limitations of existing datasets and improving interpretability.

Contribution

The authors propose a novel dataset and challenge based on the S3 idiom, along with a transparent neural system that outperforms existing baselines in this setting.

Findings

01

The new dataset emphasizes the S3 idiom for better reasoning assessment.

02

The proposed system outperforms recent baselines on the S3 challenge.

03

The benchmark improves interpretability and robustness of knowledge-augmented VQA models.

Abstract

Multimodal IR, spanning text corpus, knowledge graph and images, called outside knowledge visual question answering (OKVQA), is of much recent interest. However, the popular data set has serious limitations. A surprisingly large fraction of queries do not assess the ability to integrate cross-modal information. Instead, some are independent of the image, some depend on speculation, some require OCR or are otherwise answerable from the image alone. To add to the above limitations, frequency-based guessing is very effective because of (unintended) widespread answer overlaps between the train and test folds. Overall, it is hard to determine when state-of-the-art systems exploit these weaknesses rather than really infer the answers, because they are opaque and their 'reasoning' process is uninterpretable. An equally important limitation is that the dataset is designed for the quantitative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

s3vqa/s3vqa.github.io
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.