RPTS: Tree-Structured Reasoning Process Scoring for Faithful Multimodal Evaluation
Haofeng Wang, Yu Zhang

TL;DR
This paper introduces RPTS, a tree-structured metric for evaluating the reasoning process in multimodal models, and presents RPTS-Eval, a benchmark to assess and analyze their reasoning capabilities.
Contribution
The paper proposes a novel tree-based reasoning process scoring method and constructs a new benchmark for more faithful evaluation of multimodal reasoning in LVLMs.
Findings
LVLMs show limitations in multimodal reasoning accuracy.
RPTS effectively pinpoints reasoning failures and strengths.
Benchmark reveals differences between open-source and commercial LVLMs.
Abstract
Large Vision-Language Models (LVLMs) excel in multimodal reasoning and have shown impressive performance on various multimodal benchmarks. However, most of these benchmarks evaluate models primarily through multiple-choice or short-answer formats, which do not take the reasoning process into account. Although some benchmarks assess the reasoning process, their methods are often overly simplistic and only examine reasoning when answers are incorrect. This approach overlooks scenarios where flawed reasoning leads to correct answers. In addition, these benchmarks do not consider the impact of intermodal relationships on reasoning. To address this issue, we propose the Reasoning Process Tree Score (RPTS), a tree structure-based metric to assess reasoning processes. Specifically, we organize the reasoning steps into a reasoning tree and leverage its hierarchical information to assign…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Topic Modeling
