Are VLMs Really Blind

Ayush Singh; Mansi Gupta; Shivank Garg

arXiv:2410.22029·cs.CL·October 30, 2024

Are VLMs Really Blind

Ayush Singh, Mansi Gupta, Shivank Garg

PDF

Open Access 1 Repo

TL;DR

This paper investigates whether vision-language models are truly incapable of geometric reasoning and introduces a pipeline that enhances their ability to answer questions about images through captioning and language model reasoning.

Contribution

The authors propose a novel automatic pipeline that improves VLMs' geometric reasoning by generating question-related captions and leveraging language models without additional fine-tuning.

Findings

01

Enhanced geometric reasoning in VLMs through caption-based approach

02

Improved accuracy in answering geometry-related questions

03

Demonstrated potential for VLMs to handle low-level visual tasks

Abstract

Vision Language Models excel in handling a wide range of complex tasks, including Optical Character Recognition (OCR), Visual Question Answering (VQA), and advanced geometric reasoning. However, these models fail to perform well on low-level basic visual tasks which are especially easy for humans. Our goal in this work was to determine if these models are truly "blind" to geometric reasoning or if there are ways to enhance their capabilities in this area. Our work presents a novel automatic pipeline designed to extract key information from images in response to specific questions. Instead of just relying on direct VQA, we use question-derived keywords to create a caption that highlights important details in the image related to the question. This caption is then used by a language model to provide a precise answer to the question without requiring external fine-tuning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

vlgiitr/Are-VLMs-Really-Blind
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRetinal Diseases and Treatments