TL;DR
This paper proposes a decoupled reasoning framework combining visual interpretation models and large language models to improve multi-modal reasoning, especially in complex math problems, offering a cost-effective and flexible alternative to end-to-end LVLMs.
Contribution
It introduces a novel decoupled reasoning paradigm that leverages existing visual and linguistic models, enhancing performance on vision-language tasks without extensive end-to-end training.
Findings
Outperforms recent LVLMs on benchmarks
Achieves significant gains on geometric math problems
Demonstrates cost-efficient multi-modal reasoning
Abstract
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified process. Effective alignment needs high-quality pre-training data and a carefully designed training process. Current LVLMs face challenges when addressing complex vision-language reasoning tasks, with their reasoning capabilities notably lagging behind those of LLMs. This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework based on existing visual interpretation specialists and text-based reasoning LLMs. Our approach leverages (1) a dedicated vision-language model to transform the visual content of images into textual descriptions and (2) an LLM to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
