We-Math: Does Your Large Multimodal Model Achieve Human-like   Mathematical Reasoning?

Runqi Qiao; Qiuna Tan; Guanting Dong; Minhui Wu; Chong Sun; Xiaoshuai; Song; Zhuoma GongQue; Shanglin Lei; Zhe Wei; Miaoxuan Zhang; Runfeng Qiao,; Yifan Zhang; Xiao Zong; Yida Xu; Muxi Diao; Zhimin Bao; Chen Li; Honggang; Zhang

arXiv:2407.01284·cs.AI·July 2, 2024

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

Runqi Qiao, Qiuna Tan, Guanting Dong, Minhui Wu, Chong Sun, Xiaoshuai, Song, Zhuoma GongQue, Shanglin Lei, Zhe Wei, Miaoxuan Zhang, Runfeng Qiao,, Yifan Zhang, Xiao Zong, Yida Xu, Muxi Diao, Zhimin Bao, Chen Li, Honggang, Zhang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces WE-MATH, a benchmark for evaluating large multimodal models' human-like mathematical reasoning, focusing on knowledge principles and generalization beyond mere performance metrics.

Contribution

WE-MATH is the first benchmark to analyze reasoning principles in LMMs, categorizing 6.5K problems and proposing a novel four-dimensional assessment metric.

Findings

01

LMMs show a negative correlation between solving steps and performance.

02

Knowledge augmentation improves the IK issue in LMMs.

03

GPT-4o advances towards knowledge generalization, shifting from IK to IG.

Abstract

Visual mathematical reasoning, as a fundamental visual reasoning ability, has received widespread attention from the Large Multimodal Models (LMMs) community. Existing benchmarks, such as MathVista and MathVerse, focus more on the result-oriented performance but neglect the underlying principles in knowledge acquisition and generalization. Inspired by human-like mathematical reasoning, we introduce WE-MATH, the first benchmark specifically designed to explore the problem-solving principles beyond end-to-end performance. We meticulously collect and categorize 6.5K visual math problems, spanning 67 hierarchical knowledge concepts and five layers of knowledge granularity. We decompose composite problems into sub-problems according to the required knowledge concepts and introduce a novel four-dimensional metric, namely Insufficient Knowledge (IK), Inadequate Generalization (IG), Complete…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

we-math/we-math
noneOfficial

Datasets

We-Math/We-Math
dataset· 4.0k dl
4.0k dl

Videos

WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsSoftmax · Attention Is All You Need · Focus