DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

Kaijie Zhu; Jiaao Chen; Jindong Wang; Neil Zhenqiang Gong; Diyi Yang,; Xing Xie

arXiv:2309.17167·cs.AI·March 15, 2024

DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks

Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang,, Xing Xie

PDF

Open Access 1 Repo 3 Reviews

TL;DR

DyVal introduces a dynamic, graph-informed evaluation framework for large language models, enabling more accurate assessment of reasoning capabilities across varying complexities and aiding in model fine-tuning.

Contribution

This paper presents DyVal, a novel flexible protocol for dynamic evaluation of LLMs using graph structures, addressing limitations of static benchmarks and data contamination concerns.

Findings

01

LLMs perform worse on DyVal-generated samples with increasing complexity.

02

Dynamic evaluation reveals more nuanced performance differences among LLMs.

03

DyVal-generated data can be used for fine-tuning to enhance LLM performance.

Abstract

Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in…

Peer Reviews

Decision·ICLR 2024 spotlight

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

S1. Simple, yet flexible framework. S2. Dynamic task generation with controllable complexity S3. Extensive evaluation of selected LLMs / prompting strategies for seven simple reasoning tasks. On S1. The general idea of the proposed benchmarking framework is to generate tasks that can be described by a directed acyclic graph. This includes "compute graphs" (e.g., evaluate a numerical expression or perform logical reasoning) or "data graphs" (e.g., determine connectivity between vertices). The fr

Weaknesses

W1. Certain computational tasks only W2. Discussion of related work / results lacking W3. Limitations in generated graphs W4. Code/data availability unclear W5. Limited insight of experimental study On W1. By the nature of the benchmark, it focuses on problems that can be expressed as (currently small) compute graphs or data graphs and are somewhat artificial. It only tests a very limited field of LLM functionality. On W2. There are benchmarks for all of the tasks that are implemented in this

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The motivation of this paper is clear. As many LLMs tend to memorize static data for evaluation, this paper proposes a dynamic approach to avoid this kind of problem. 2. The idea of generating tasks with different difficulties in a DAG style sounds interesting. 3. The problem is clearly described with sufficient notations and examples. 4. Experiments are conducted in various aspects, including 7 reasoning tasks, 1 human evaluation, on about 8 well-known LLMs. Fine-tuning experiments are al

Weaknesses

1. The title is somewhat misleading. The evaluation tasks in this paper are mostly about reasoning on maths, logic, algorithms, etc. However, the title reflects no information about this point. The abstract could be also clearer if this point can be mentioned earlier. 2. For the fine-tuning results in Section 5, I wonder when these LLMs are fine-tuned for the reasoning tasks proposed in this method, will the general abilities be influenced? Or to what extent will they be influenced? 3. As the

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 5

Strengths

- Extensive experiments are conducted. - Graph-based notions of complexities can be used as a means to control the compositional complexity of the examples. - Address data contamination and static complexity of the benchmarks.

Weaknesses

- A common challenge associated with this framework is the need to manually specify a problem as a computation graph with valid constraints. This requirement is only understandable if LLM is intended to acquire specific skills written in these formats. - Before reading this paper, I believed that generating a large number of mathematical problems of specific types and evaluating LLMs on them was primarily for debugging specific LLM capabilities, such as compositionality, rather than as an evalu

Code & Models

Repositories

microsoft/promptbench
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Transformer · Residual Connection · Weight Decay