Smart Vision-Language Reasoners

Denisa Roberts; Lucas Roberts

arXiv:2407.04212·cs.AI·July 8, 2024

Smart Vision-Language Reasoners

Denisa Roberts, Lucas Roberts

PDF

Open Access 1 Repo

TL;DR

This paper explores the reasoning capabilities of vision-language models across multiple axes using the SMART task, introducing a novel multimodal layer and achieving significant accuracy improvements in multimodal reasoning tasks.

Contribution

The paper introduces a new multimodal layer and training strategies that significantly enhance vision-language models' reasoning abilities across eight fundamental axes.

Findings

01

Up to 48% accuracy gain on the SMART task.

02

The novel QF multimodal layer improves reasoning performance.

03

Composite representations with cross-attention enhance visual grounding.

Abstract

In this article, we investigate vision-language models (VLM) as reasoners. The ability to form abstractions underlies mathematical reasoning, problem-solving, and other Math AI tasks. Several formalisms have been given to these underlying abstractions and skills utilized by humans and intelligent systems for reasoning. Furthermore, human reasoning is inherently multimodal, and as such, we focus our investigations on multimodal AI. In this article, we employ the abstractions given in the SMART task (Simple Multimodal Algorithmic Reasoning Task) introduced in \cite{cherian2022deep} as meta-reasoning and problem-solving skills along eight axes: math, counting, path, measure, logic, spatial, and pattern. We investigate the ability of vision-language models to reason along these axes and seek avenues of improvement. Including composite representations with vision-language cross-attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

smarter-vlm/smarter
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsFocus