Visual Question Answering based on Formal Logic
Muralikrishnna G. Sethuraman, Ali Payani, Faramarz Fekri, J. Clayton, Kerce

TL;DR
This paper introduces a formal logic-based approach to visual question answering that converts images and questions into symbolic representations for explicit reasoning, achieving high accuracy and interpretability.
Contribution
It presents a novel framework combining scene graphs and transformer-based translation to first-order logic for VQA, demonstrating high accuracy and interpretability.
Findings
Achieves 99.6% accuracy on CLEVR dataset.
Highly interpretable reasoning process.
Data-efficient, with 99.1% accuracy using only 10% of training data.
Abstract
Visual question answering (VQA) has been gaining a lot of traction in the machine learning community in the recent years due to the challenges posed in understanding information coming from multiple modalities (i.e., images, language). In VQA, a series of questions are posed based on a set of images and the task at hand is to arrive at the answer. To achieve this, we take a symbolic reasoning based approach using the framework of formal logic. The image and the questions are converted into symbolic representations on which explicit reasoning is performed. We propose a formal logic framework where (i) images are converted to logical background facts with the help of scene graphs, (ii) the questions are translated to first-order predicate logic clauses using a transformer based deep learning model, and (iii) perform satisfiability checks, by using the background knowledge and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
