DEVAL: A Framework for Evaluating and Improving the Derivation Capability of Large Language Models
Yifan Li, Qin Li, Min Zhang, Min Zhang

TL;DR
This paper introduces DEVAL, a framework to evaluate and enhance the derivation reasoning capabilities of large language models, revealing their limitations and proposing a prompt engineering method to improve their reasoning performance.
Contribution
The paper formally defines Derivation Relation (DR) and Derivation Capability (DC), and presents DEVAL, a systematic evaluation framework along with Derivation Prompting (DP) to improve LLM reasoning.
Findings
LLMs show moderate DR recognition but struggle in problem-solving scenarios.
DEVAL effectively evaluates derivation reasoning across multiple tasks.
Derivation Prompting improves DC by an average of 15.2%.
Abstract
Assessing the reasoning ability of Large Language Models (LLMs) over data remains an open and pressing research question. Compared with LLMs, human reasoning can derive corresponding modifications to the output based on certain kinds of changes to the input. This reasoning pattern, which relies on abstract rules that govern relationships between changes of data, has not been comprehensively described or evaluated in LLMs. In this paper, we formally define this reasoning pattern as the Derivation Relation (DR) and introduce the concept of Derivation Capability (DC), i.e. applying DR by making the corresponding modification to the output whenever the input takes certain changes. To assess DC, a systematically constructed evaluation framework named DEVAL is proposed and used to evaluate five popular LLMs and one Large Reasoning Model in seven mainstream tasks. The evaluation results show…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Text Readability and Simplification
