Enhancing Advanced Visual Reasoning Ability of Large Language Models
Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, Weidong, Cai

TL;DR
This paper introduces CVR-LLM, a novel approach combining visual perception and reasoning capabilities of VLMs and LLMs, achieving state-of-the-art results in complex visual reasoning tasks without additional training.
Contribution
The paper proposes a new multimodal large language model that transforms images into detailed descriptions, uses iterative self-refinement, and introduces multi-modal in-context learning and Chain-of-Comparison techniques.
Findings
Achieves state-of-the-art performance on complex visual reasoning benchmarks.
Effectively combines visual perception with advanced reasoning without extra training.
Introduces novel multi-modal in-context learning and Chain-of-Comparison methods.
Abstract
Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models' advanced reasoning ability. Traditional Vision-Language Models (VLMs) perform well in visual perception tasks while struggling with complex reasoning scenarios. Conversely, Large Language Models (LLMs) demonstrate robust text reasoning capabilities; however, they lack visual acuity. To bridge this gap, we propose Complex Visual Reasoning Large Language Models (CVR-LLM), capitalizing on VLMs' visual perception proficiency and LLMs' extensive reasoning capability. Unlike recent multimodal large language models (MLLMs) that require a projection layer, our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop and leverages LLMs' text knowledge for accurate predictions without extra training. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
