Smart Vision-Language Reasoners
Denisa Roberts, Lucas Roberts

TL;DR
This paper explores the reasoning capabilities of vision-language models across multiple axes using the SMART task, introducing a novel multimodal layer and achieving significant accuracy improvements in multimodal reasoning tasks.
Contribution
The paper introduces a new multimodal layer and training strategies that significantly enhance vision-language models' reasoning abilities across eight fundamental axes.
Findings
Up to 48% accuracy gain on the SMART task.
The novel QF multimodal layer improves reasoning performance.
Composite representations with cross-attention enhance visual grounding.
Abstract
In this article, we investigate vision-language models (VLM) as reasoners. The ability to form abstractions underlies mathematical reasoning, problem-solving, and other Math AI tasks. Several formalisms have been given to these underlying abstractions and skills utilized by humans and intelligent systems for reasoning. Furthermore, human reasoning is inherently multimodal, and as such, we focus our investigations on multimodal AI. In this article, we employ the abstractions given in the SMART task (Simple Multimodal Algorithmic Reasoning Task) introduced in \cite{cherian2022deep} as meta-reasoning and problem-solving skills along eight axes: math, counting, path, measure, logic, spatial, and pattern. We investigate the ability of vision-language models to reason along these axes and seek avenues of improvement. Including composite representations with vision-language cross-attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsFocus
