DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
Zhehao Zhang, Jiaao Chen, Diyi Yang

TL;DR
DARG introduces a dynamic evaluation framework that adaptively generates complex reasoning data for LLMs, revealing their performance drops and biases as complexity increases, thus providing a more robust assessment method.
Contribution
This work presents a novel method to dynamically extend benchmarks by perturbing reasoning graphs, enabling adaptive and controlled evaluation of LLMs across multiple domains.
Findings
LLMs' performance decreases with increased data complexity
Certain LLMs show significant performance drops at higher complexity levels
Higher complexity evaluations reveal increased biases in LLMs
Abstract
The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks. We further use a code-augmented LLM to ensure the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
