From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Seungdong Yoa; Sanghyu Yoon; Suhee Yoon; Dongmin Kim; Ye Seul Sim; Junhyun Lee; Woohyung Lim

arXiv:2602.23729·cs.CL·March 2, 2026

From Static Benchmarks to Dynamic Protocol: Agent-Centric Text Anomaly Detection for Evaluating LLM Reasoning

Seungdong Yoa, Sanghyu Yoon, Suhee Yoon, Dongmin Kim, Ye Seul Sim, Junhyun Lee, Woohyung Lim

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a dynamic, agent-centric benchmarking protocol for evaluating large language models, enabling scalable, evolving assessments that better reveal reasoning errors than static datasets.

Contribution

It proposes a novel dynamic evaluation framework involving autonomous agents that generate, validate, and solve problems, moving beyond traditional static benchmarks.

Findings

01

Systematically exposes reasoning errors in LLMs

02

Enables automatic scaling of difficulty in evaluation

03

Highlights limitations of static benchmarks

Abstract

The evaluation of large language models (LLMs) has predominantly relied on static datasets, which offer limited scalability and fail to capture the evolving reasoning capabilities of recent models. To overcome these limitations, we propose an agent-centric benchmarking paradigm that moves beyond static datasets by introducing a dynamic protocol in which autonomous agents iteratively generate, validate, and solve problems. Within this protocol, a teacher agent generates candidate problems, an orchestrator agent rigorously verifies their validity and guards against adversarial attacks, and a student agent attempts to solve the validated problems. An invalid problem is revised by the teacher agent until it passes validation. If the student correctly solves the problem, the orchestrator prompts the teacher to generate more challenging variants. Consequently, the benchmark scales in…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper addresses a timely issue where static benchmarks are becoming saturated and insufficient to capture the evolving reasoning capabilities of modern LLMs. 2. The proposed teacher–orchestrator–student framework is conceptually interesting and provides a scalable, automated approach for generating dynamic and evolving benchmarks. 3. The experiments provide meaningful insights into LLM reasoning across different model families on text anomaly detection and the contributions of individua

Weaknesses

1. Based on the prompts shown in Appendix D, the current framework appears to generate questions primarily from the LLM’s inherent knowledge rather than external sources. It is unclear how well this approach generalizes to reasoning problems that require access to factual data or rigorous mathematical reasoning, given that LLMs’ capabilities in these domains are still evolving and may limit the robustness of generated questions. 2. Since the benchmark questions are generated by LLMs, potential

Reviewer 02Rating 2Confidence 4

Strengths

* The proposed agentic co-evolution of models and benchmarks represents a natural next step after efforts like BenchAgents and DyVal, making this work both timely and forward-looking. * The inclusion of Orchestrator-regulated validation and failure-driven sample finalization help control over problem difficulty and fairness. * The taxonomy of seven anomaly types—each probing distinct reasoning skills (contextual, logical, referential, stylistic, etc.)—is well-grounded and illustrative

Weaknesses

* While the empirical results are strong, the notion of “difficulty increase” remains heuristic and behaviorally defined (based on Student failure). A more formal definition would strengthen claims. * The paper cites prior dynamic evaluation works (DyVal, DARG, Benchmark Self-Evolving), but comparative experiments are missing. It is unclear how ATAD quantitatively improves over these prior agent-based or meta-probing systems beyond conceptual novelty. * Although text anomaly detection is reaso

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper tackles a well-motivated and timely problem, clearly articulating the critical failings of static benchmarks, such as data contamination and performance saturation. This strong motivation establishes a clear need for the proposed dynamic evaluation protocol. By addressing these limitations, the work provides a valuable and forward-looking contribution to LLM evaluation. 2. The evaluation is comprehensive, featuring a wide array of models from different families (e.g., GPT, Gemini,

Weaknesses

1. The paper's primary contribution is the novel protocol (ATAD) and its application, rather than a new underlying model architecture or training technique. While this agent-centric framework is a significant engineering and conceptual achievement, the work relies entirely on existing LLMs as its core components. This focus on the evaluation system means there is limited technical novelty in terms of model-level innovations. 2. While the paper demonstrates that the protocol creates more difficu

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification