TL;DR
This paper introduces NLKI, a framework that enhances small vision-language models in commonsense VQA tasks by integrating retrieved facts and explanations from large language models, significantly improving accuracy and reducing hallucinations.
Contribution
NLKI is the first end-to-end framework combining knowledge retrieval and LLM explanations to boost small VLMs in commonsense reasoning tasks.
Findings
NLKI improves answer accuracy by up to 7% across datasets.
Noise-robust training adds an additional 2.5-5.5% accuracy gain.
Knowledge integration can outperform knowledge base retrieval in certain cases.
Abstract
Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
