ValueBlindBench: Agreement-Gated Stress Testing of LLM-Judged Investment Rationales Before Returns Are Observable
Sidi Chang, Peiying Zhu, Yuxiao Chen

TL;DR
ValueBlindBench is a protocol that uses agreement gating to evaluate LLM-generated investment rationales before actual returns are observable, addressing delayed-truth evaluation challenges in AI-finance.
Contribution
It introduces a novel agreement-gated stress-test protocol for pre-deployment evaluation of LLM judges in finance, ensuring reliability and robustness of investment rationale claims.
Findings
ValueBlindBench clears the agreement gate at 0.7168 but prevents overclaims.
Lower-rank systems tend to collapse into a tie-class.
Financial constructs like constraint awareness are operationally load-bearing.
Abstract
LLM-based financial agents increasingly produce investment rationales before the outcomes needed to evaluate them are observable. This creates a delayed-ground-truth evaluation problem: realized returns remain the eventual arbiter of investment quality, but they arrive too late and are too noisy to guide many model-development and governance decisions. LLM judges offer a tempting shortcut for pre-deployment evaluation of AI-finance systems, but unvalidated judges may reward verbosity, confidence, or rubric mimicry rather than financial judgment. This paper introduces ValueBlindBench, a preregistered agreement-gated stress-test protocol for deciding when LLM-judged investment-rationale claims are publishable, qualified, or invalid. In a controlled market-state capital-allocation prototype with 1,000 honest decision cycles and 100 preregistered adversarial controls (1,100 trajectories,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
