NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Aritra Dutta; Swapnanil Mukherjee; Deepanway Ghosal; Somak Aditya

arXiv:2508.19724·cs.CL·August 29, 2025

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Aritra Dutta, Swapnanil Mukherjee, Deepanway Ghosal, Somak Aditya

PDF

1 Models

TL;DR

This paper introduces NLKI, a framework that enhances small vision-language models in commonsense VQA tasks by integrating retrieved facts and explanations from large language models, significantly improving accuracy and reducing hallucinations.

Contribution

NLKI is the first end-to-end framework combining knowledge retrieval and LLM explanations to boost small VLMs in commonsense reasoning tasks.

Findings

01

NLKI improves answer accuracy by up to 7% across datasets.

02

Noise-robust training adds an additional 2.5-5.5% accuracy gain.

03

Knowledge integration can outperform knowledge base retrieval in certain cases.

Abstract

Commonsense visual-question answering often hinges on knowledge that is missing from the image or the question. Small vision-language models (sVLMs) such as ViLT, VisualBERT and FLAVA therefore lag behind their larger generative counterparts. To study the effect of careful commonsense knowledge integration on sVLMs, we present an end-to-end framework (NLKI) that (i) retrieves natural language facts, (ii) prompts an LLM to craft natural language explanations, and (iii) feeds both signals to sVLMs respectively across two commonsense VQA datasets (CRIC, AOKVQA) and a visual-entailment dataset (e-SNLI-VE). Facts retrieved using a fine-tuned ColBERTv2 and an object information-enriched prompt yield explanations that largely cut down hallucinations, while lifting the end-to-end answer accuracy by up to 7% (across 3 datasets), making FLAVA and other models in NLKI match or exceed medium-sized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
dutta18/Colbert-Finetuned
model· 1 dl
1 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.