Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

Mohammad Al Ridhawi; Mahtab Haj Ali; Hussein Al Osman

arXiv:2605.05739·cs.LG·May 19, 2026

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

PDF

TL;DR

This paper introduces a behavioral evaluation framework for agentic AI systems, using LLM judges to analyze decision processes and improve stock prediction performance through closed-loop reinforcement learning.

Contribution

It develops a novel multi-dimensional behavioral scoring method for autonomous decision processes and demonstrates its effectiveness in enhancing stock prediction accuracy.

Findings

01

Behavioral scores correlate with 20-day Sharpe ratio (rho=0.72).

02

Fine-tuning reduces one-day MAPE from 0.61% to 0.54%.

03

High inter-judge agreement with Krippendorff's alpha=0.85.

Abstract

Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.