Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback
Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

TL;DR
This paper introduces a behavioral evaluation framework for agentic AI systems, using LLM judges to analyze decision processes and improve stock prediction performance through closed-loop reinforcement learning.
Contribution
It develops a novel multi-dimensional behavioral scoring method for autonomous decision processes and demonstrates its effectiveness in enhancing stock prediction accuracy.
Findings
Behavioral scores correlate with 20-day Sharpe ratio (rho=0.72).
Fine-tuning reduces one-day MAPE from 0.61% to 0.54%.
High inter-judge agreement with Krippendorff's alpha=0.85.
Abstract
Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
