TL;DR
Prism is a modular framework that separates perception and reasoning in vision-language models, enabling detailed assessment and improving efficiency in vision-language tasks by combining perception-focused VLMs with large language models.
Contribution
Prism introduces a novel two-stage approach to disentangle perception and reasoning in VLMs, allowing systematic evaluation and enhancing performance with reduced costs.
Findings
Prism achieves comparable performance to larger VLMs on MMStar benchmark.
Using GPT-3.5 and a 2B VLM, Prism reduces training and operational expenses.
The framework enables detailed analysis of perception and reasoning capabilities.
Abstract
Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for…
Peer Reviews
Decision·NeurIPS 2024 poster
1. The author present good analysis, findings and insights based on their framework. The insights are valuable. 2. The author provide decent amount of experimentation results on many VLMs. The author demonstrate the soundness and effectiveness of their framework with experimentations. 3. Prism can be useful in both evaluation and task solver.
1. There is not much unique and novel contributions in terms of algorithms and model designs.
Code & Models
Videos
Taxonomy
TopicsElevator Systems and Control
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Linear Layer · Cosine Annealing · Multi-Head Attention · Residual Connection · Softmax · Layer Normalization · Byte Pair Encoding · Attention Dropout
