TestAgent: An Adaptive and Intelligent Expert for Human Assessment

Junhao Yu; Yan Zhuang; YuXuan Sun; Weibo Gao; Qi Liu; Mingyue Cheng; Zhenya Huang; Enhong Chen

arXiv:2506.03032·cs.AI·June 4, 2025

TestAgent: An Adaptive and Intelligent Expert for Human Assessment

Junhao Yu, Yan Zhuang, YuXuan Sun, Weibo Gao, Qi Liu, Mingyue Cheng, Zhenya Huang, Enhong Chen

PDF

Open Access 4 Reviews

TL;DR

TestAgent introduces an LLM-powered adaptive testing system that personalizes assessments, reduces question count by 20%, and improves accuracy and user experience across various domains.

Contribution

This paper presents the first application of large language models in adaptive testing, enhancing personalization, interaction, and accuracy in human assessment.

Findings

01

Achieves 20% fewer questions than baselines.

02

Improves accuracy of assessments.

03

Receives user preference for speed and smoothness.

Abstract

Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the mainstream method for human measurement and has now been widely applied in education, healthcare, sports, and sociology. It customizes assessments by selecting the fewest test questions . However, current adaptive testing methods face several challenges. The mechanized nature of most algorithms leads to guessing behavior and difficulties with open-ended questions. Additionally, subjective assessments suffer from noisy response data and coarse-grained test outputs, further limiting their effectiveness. To move closer to an ideal adaptive testing process, we propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

The idea here is good and the solution is comprehensive. At a high level the authors are proposing to use LLMs to perform human assessment vs. using existing rigid approaches. This intuitively makes sense, but of course is not straight forward due to the inclusion of a potentially unreliable model. Representing the users performance based on previous answers, and using this representation to implement anomaly detection is an interesting concept. In general the anomaly detection component is co

Weaknesses

The Universal Data Infrastructure is not well defined, I came away not really understanding this component. Please revise the description with a focus on clarity. Specifically cognitive diagnosis training. The major weakness here is that using an LLM introduces a new problem, namely LLM hallucinations leading to misrepresentation of results. The authors bring up this up in section 3.8. It would be worthwhile for the authors to discuss the trade off between the introduction of hallucinations a

Reviewer 02Rating 3Confidence 4

Strengths

- The work attempts to tackle an important problem of human measurement. It seems like a good idea to leverage LLM-based agents for this purpose. - The work tries to study their proposed systems through a series of simulated and real-world experiments.

Weaknesses

- Overall, the writing and organization of the paper are hard to follow and thus make it hard to provide a nuanced assessment of the submission. There are also many typos throughout the work—please proofread carefully. The remaining comments will talk more about suggestions for improving the presentation of methodology as well as recommendations for evaluation clarity. - What exactly is being measured through human assessments is unclear throughout Section 2, i.e. what is exactly captured by $\t

Reviewer 03Rating 5Confidence 4

Strengths

Originality: One of the paper’s most compelling strengths is its originality. By integrating large language models (LLMs) into the adaptive testing process, it introduces a novel application that creatively merges advanced language modeling with educational and psychological assessment. This approach innovates beyond traditional methods by offering a conversational, interactive experience in adaptive testing, thereby removing several limitations of rigid, fixed-question formats. The introduction

Weaknesses

Although the paper demonstrates promising results, the experiments rely heavily on synthetic or simulated data to train and evaluate TestAgent. While this is a practical approach for initial testing, it limits the generalizability and applicability of the results to real-world scenarios, where user responses may be less predictable and more varied. Incorporating real-world data from actual assessments, such as real educational tests or personality assessments, could better validate the model’s e

Reviewer 04Rating 6Confidence 2

Strengths

One of the most remarkable aspects of this paper is its introduction of a new task, RLPA, which fills a gap in the field of Knowledge Tracing. Most traditional models assume that training and test data distributions remain the same, but CUFF-KT challenges that assumption with a much more flexible approach. This isn’t just a refinement of existing models—it’s a fresh way of addressing a real-world problem. Specifically, it handles learners’ constantly evolving learning patterns without needing to

Weaknesses

While the paper focuses on CUFF-KT’s use in Knowledge Tracing, it would be interesting to see if this approach could be applied to other areas. For example, could CUFF-KT’s adaptive framework be useful in adaptive testing systems or personalized recommendation systems? Could this model work just as well in those scenarios, or would modifications be needed? Exploring this could broaden the paper’s overall impact. The datasets used in the experiments are relevant, but the variety is somewhat limi

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Safety Analysis