Augmented Vision-Language Models: A Systematic Review

Anthony C Davis; Burhan Sadiq; Tianmin Shu; and Chien-Ming Huang

arXiv:2507.22933·cs.CL·August 1, 2025

Augmented Vision-Language Models: A Systematic Review

Anthony C Davis, Burhan Sadiq, Tianmin Shu, and Chien-Ming Huang

PDF

Open Access

TL;DR

This paper systematically reviews how integrating external symbolic systems with vision-language models can improve interpretability, reasoning, and adaptability, addressing limitations of current large-scale training methods.

Contribution

It categorizes techniques for enhancing visual-language understanding through neural-symbolic integration, highlighting the potential benefits and current approaches.

Findings

01

Neural-symbolic systems improve interpretability of vision-language models.

02

External symbolic systems enable models to incorporate new information without retraining.

03

The review identifies key categories and methods for integrating symbolic reasoning with visual-language models.

Abstract

Recent advances in visual-language machine learning models have demonstrated exceptional ability to use natural language and understand visual scenes by training on large, unstructured datasets. However, this training paradigm cannot produce interpretable explanations for its outputs, requires retraining to integrate new information, is highly resource-intensive, and struggles with certain forms of logical reasoning. One promising solution involves integrating neural networks with external symbolic information systems, forming neural symbolic systems that can enhance reasoning and memory abilities. These neural symbolic systems provide more interpretable explanations to their outputs and the capacity to assimilate new information without extensive retraining. Utilizing powerful pre-trained Vision-Language Models (VLMs) as the core neural component, augmented by external systems, offers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

Topics3D Surveying and Cultural Heritage · Multimodal Machine Learning Applications · Virtual Reality Applications and Impacts