Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Zixian Guo; Ming Liu; Qilong Wang; Zhilong Ji; Jinfeng Bai; Lei Zhang; Wangmeng Zuo

arXiv:2505.17609·cs.AI·August 14, 2025

Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving

Zixian Guo, Ming Liu, Qilong Wang, Zhilong Ji, Jinfeng Bai, Lei Zhang, Wangmeng Zuo

PDF

1 Repo

TL;DR

This paper proposes a decoupled reasoning framework combining visual interpretation models and large language models to improve multi-modal reasoning, especially in complex math problems, offering a cost-effective and flexible alternative to end-to-end LVLMs.

Contribution

It introduces a novel decoupled reasoning paradigm that leverages existing visual and linguistic models, enhancing performance on vision-language tasks without extensive end-to-end training.

Findings

01

Outperforms recent LVLMs on benchmarks

02

Achieves significant gains on geometric math problems

03

Demonstrates cost-efficient multi-modal reasoning

Abstract

Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs) and use end-to-end training to achieve multi-modal understanding in a unified process. Effective alignment needs high-quality pre-training data and a carefully designed training process. Current LVLMs face challenges when addressing complex vision-language reasoning tasks, with their reasoning capabilities notably lagging behind those of LLMs. This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework based on existing visual interpretation specialists and text-based reasoning LLMs. Our approach leverages (1) a dedicated vision-language model to transform the visual content of images into textual descriptions and (2) an LLM to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

guozix/dvlr
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.