Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning
Julian Perry, Surasakdi Siripong, Thanakorn Phonchai

TL;DR
This paper introduces AKGP-LVLM, a novel method that dynamically integrates external knowledge into large vision-language models, significantly improving their performance on knowledge-intensive multimodal tasks.
Contribution
The paper presents a new adaptive pretraining approach that incorporates structured and unstructured knowledge into LVLMs during training and fine-tuning.
Findings
Achieved significant performance gains on four benchmark datasets.
Human evaluations show improved correctness and relevance.
Demonstrated robustness, efficiency, and scalability of the method.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities in multimodal tasks, but their performance is often constrained by the lack of external knowledge integration, limiting their ability to handle knowledge-intensive tasks such as visual question answering and reasoning. To address this challenge, we propose a novel method, Adaptive Knowledge-Guided Pretraining for Large Vision-Language Models (AKGP-LVLM), which dynamically incorporates structured and unstructured knowledge into LVLMs during pretraining and fine-tuning. Our approach employs a knowledge encoder to represent external knowledge, a retrieval mechanism to select task-relevant information, and a dynamic adaptor to align multimodal and knowledge representations effectively. We evaluate our method on four benchmark datasets, demonstrating significant performance improvements over state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConstraint Satisfaction and Optimization
MethodsALIGN
