DyVal: Dynamic Evaluation of Large Language Models for Reasoning Tasks
Kaijie Zhu, Jiaao Chen, Jindong Wang, Neil Zhenqiang Gong, Diyi Yang,, Xing Xie

TL;DR
DyVal introduces a dynamic, graph-informed evaluation framework for large language models, enabling more accurate assessment of reasoning capabilities across varying complexities and aiding in model fine-tuning.
Contribution
This paper presents DyVal, a novel flexible protocol for dynamic evaluation of LLMs using graph structures, addressing limitations of static benchmarks and data contamination concerns.
Findings
LLMs perform worse on DyVal-generated samples with increasing complexity.
Dynamic evaluation reveals more nuanced performance differences among LLMs.
DyVal-generated data can be used for fine-tuning to enhance LLM performance.
Abstract
Large language models (LLMs) have achieved remarkable performance in various evaluation benchmarks. However, concerns are raised about potential data contamination in their considerable volume of training corpus. Moreover, the static nature and fixed complexity of current benchmarks may inadequately gauge the advancing capabilities of LLMs. In this paper, we introduce DyVal, a general and flexible protocol for dynamic evaluation of LLMs. Based on our framework, we build graph-informed DyVal by leveraging the structural advantage of directed acyclic graphs to dynamically generate evaluation samples with controllable complexities. DyVal generates challenging evaluation sets on reasoning tasks including mathematics, logical reasoning, and algorithm problems. We evaluate various LLMs ranging from Flan-T5-large to GPT-3.5-Turbo and GPT-4. Experiments show that LLMs perform worse in…
Peer Reviews
Decision·ICLR 2024 spotlight
S1. Simple, yet flexible framework. S2. Dynamic task generation with controllable complexity S3. Extensive evaluation of selected LLMs / prompting strategies for seven simple reasoning tasks. On S1. The general idea of the proposed benchmarking framework is to generate tasks that can be described by a directed acyclic graph. This includes "compute graphs" (e.g., evaluate a numerical expression or perform logical reasoning) or "data graphs" (e.g., determine connectivity between vertices). The fr
W1. Certain computational tasks only W2. Discussion of related work / results lacking W3. Limitations in generated graphs W4. Code/data availability unclear W5. Limited insight of experimental study On W1. By the nature of the benchmark, it focuses on problems that can be expressed as (currently small) compute graphs or data graphs and are somewhat artificial. It only tests a very limited field of LLM functionality. On W2. There are benchmarks for all of the tasks that are implemented in this
1. The motivation of this paper is clear. As many LLMs tend to memorize static data for evaluation, this paper proposes a dynamic approach to avoid this kind of problem. 2. The idea of generating tasks with different difficulties in a DAG style sounds interesting. 3. The problem is clearly described with sufficient notations and examples. 4. Experiments are conducted in various aspects, including 7 reasoning tasks, 1 human evaluation, on about 8 well-known LLMs. Fine-tuning experiments are al
1. The title is somewhat misleading. The evaluation tasks in this paper are mostly about reasoning on maths, logic, algorithms, etc. However, the title reflects no information about this point. The abstract could be also clearer if this point can be mentioned earlier. 2. For the fine-tuning results in Section 5, I wonder when these LLMs are fine-tuned for the reasoning tasks proposed in this method, will the general abilities be influenced? Or to what extent will they be influenced? 3. As the
- Extensive experiments are conducted. - Graph-based notions of complexities can be used as a means to control the compositional complexity of the examples. - Address data contamination and static complexity of the benchmarks.
- A common challenge associated with this framework is the need to manually specify a problem as a computation graph with valid constraints. This requirement is only understandable if LLM is intended to acquire specific skills written in these formats. - Before reading this paper, I believed that generating a large number of mathematical problems of specific types and evaluating LLMs on them was primarily for debugging specific LLM capabilities, such as compositionality, rather than as an evalu
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Transformer · Residual Connection · Weight Decay
