Are VLMs Really Blind
Ayush Singh, Mansi Gupta, Shivank Garg

TL;DR
This paper investigates whether vision-language models are truly incapable of geometric reasoning and introduces a pipeline that enhances their ability to answer questions about images through captioning and language model reasoning.
Contribution
The authors propose a novel automatic pipeline that improves VLMs' geometric reasoning by generating question-related captions and leveraging language models without additional fine-tuning.
Findings
Enhanced geometric reasoning in VLMs through caption-based approach
Improved accuracy in answering geometry-related questions
Demonstrated potential for VLMs to handle low-level visual tasks
Abstract
Vision Language Models excel in handling a wide range of complex tasks, including Optical Character Recognition (OCR), Visual Question Answering (VQA), and advanced geometric reasoning. However, these models fail to perform well on low-level basic visual tasks which are especially easy for humans. Our goal in this work was to determine if these models are truly "blind" to geometric reasoning or if there are ways to enhance their capabilities in this area. Our work presents a novel automatic pipeline designed to extract key information from images in response to specific questions. Instead of just relying on direct VQA, we use question-derived keywords to create a caption that highlights important details in the image related to the question. This caption is then used by a language model to provide a precise answer to the question without requiring external fine-tuning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRetinal Diseases and Treatments
