TestAgent: An Adaptive and Intelligent Expert for Human Assessment
Junhao Yu, Yan Zhuang, YuXuan Sun, Weibo Gao, Qi Liu, Mingyue Cheng, Zhenya Huang, Enhong Chen

TL;DR
TestAgent introduces an LLM-powered adaptive testing system that personalizes assessments, reduces question count by 20%, and improves accuracy and user experience across various domains.
Contribution
This paper presents the first application of large language models in adaptive testing, enhancing personalization, interaction, and accuracy in human assessment.
Findings
Achieves 20% fewer questions than baselines.
Improves accuracy of assessments.
Receives user preference for speed and smoothness.
Abstract
Accurately assessing internal human states is key to understanding preferences, offering personalized services, and identifying challenges in real-world applications. Originating from psychometrics, adaptive testing has become the mainstream method for human measurement and has now been widely applied in education, healthcare, sports, and sociology. It customizes assessments by selecting the fewest test questions . However, current adaptive testing methods face several challenges. The mechanized nature of most algorithms leads to guessing behavior and difficulties with open-ended questions. Additionally, subjective assessments suffer from noisy response data and coarse-grained test outputs, further limiting their effectiveness. To move closer to an ideal adaptive testing process, we propose TestAgent, a large language model (LLM)-powered agent designed to enhance adaptive testing…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
The idea here is good and the solution is comprehensive. At a high level the authors are proposing to use LLMs to perform human assessment vs. using existing rigid approaches. This intuitively makes sense, but of course is not straight forward due to the inclusion of a potentially unreliable model. Representing the users performance based on previous answers, and using this representation to implement anomaly detection is an interesting concept. In general the anomaly detection component is co
The Universal Data Infrastructure is not well defined, I came away not really understanding this component. Please revise the description with a focus on clarity. Specifically cognitive diagnosis training. The major weakness here is that using an LLM introduces a new problem, namely LLM hallucinations leading to misrepresentation of results. The authors bring up this up in section 3.8. It would be worthwhile for the authors to discuss the trade off between the introduction of hallucinations a
- The work attempts to tackle an important problem of human measurement. It seems like a good idea to leverage LLM-based agents for this purpose. - The work tries to study their proposed systems through a series of simulated and real-world experiments.
- Overall, the writing and organization of the paper are hard to follow and thus make it hard to provide a nuanced assessment of the submission. There are also many typos throughout the work—please proofread carefully. The remaining comments will talk more about suggestions for improving the presentation of methodology as well as recommendations for evaluation clarity. - What exactly is being measured through human assessments is unclear throughout Section 2, i.e. what is exactly captured by $\t
Originality: One of the paper’s most compelling strengths is its originality. By integrating large language models (LLMs) into the adaptive testing process, it introduces a novel application that creatively merges advanced language modeling with educational and psychological assessment. This approach innovates beyond traditional methods by offering a conversational, interactive experience in adaptive testing, thereby removing several limitations of rigid, fixed-question formats. The introduction
Although the paper demonstrates promising results, the experiments rely heavily on synthetic or simulated data to train and evaluate TestAgent. While this is a practical approach for initial testing, it limits the generalizability and applicability of the results to real-world scenarios, where user responses may be less predictable and more varied. Incorporating real-world data from actual assessments, such as real educational tests or personality assessments, could better validate the model’s e
One of the most remarkable aspects of this paper is its introduction of a new task, RLPA, which fills a gap in the field of Knowledge Tracing. Most traditional models assume that training and test data distributions remain the same, but CUFF-KT challenges that assumption with a much more flexible approach. This isn’t just a refinement of existing models—it’s a fresh way of addressing a real-world problem. Specifically, it handles learners’ constantly evolving learning patterns without needing to
While the paper focuses on CUFF-KT’s use in Knowledge Tracing, it would be interesting to see if this approach could be applied to other areas. For example, could CUFF-KT’s adaptive framework be useful in adaptive testing systems or personalized recommendation systems? Could this model work just as well in those scenarios, or would modifications be needed? Exploring this could broaden the paper’s overall impact. The datasets used in the experiments are relevant, but the variety is somewhat limi
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRisk and Safety Analysis
