Your Offline Policy is Not Trustworthy: Bilevel Reinforcement Learning for Sequential Portfolio Optimization

Haochen Yuan; Minting Pan; Yunbo Wang; Siyu Gao; Philip S.Yu; Xiaokang Yang

arXiv:2505.12759·cs.LG·May 20, 2025

Your Offline Policy is Not Trustworthy: Bilevel Reinforcement Learning for Sequential Portfolio Optimization

Haochen Yuan, Minting Pan, Yunbo Wang, Siyu Gao, Philip S.Yu, Xiaokang Yang

PDF

Open Access

TL;DR

This paper introduces MetaTrader, a bilevel reinforcement learning framework for portfolio optimization that enhances out-of-domain performance and addresses value overestimation in offline RL settings.

Contribution

MetaTrader is the first to explicitly train RL agents for both in-domain and out-of-domain stock trading performance using a bilevel learning approach.

Findings

01

MetaTrader outperforms existing RL and traditional models on stock datasets.

02

The bilevel framework improves generalization to data transformations.

03

The new TD method reduces value overestimation in offline RL.

Abstract

Reinforcement learning (RL) has shown significant promise for sequential portfolio optimization tasks, such as stock trading, where the objective is to maximize cumulative returns while minimizing risks using historical data. However, traditional RL approaches often produce policies that merely memorize the optimal yet impractical buying and selling behaviors within the fixed dataset. These offline policies are less generalizable as they fail to account for the non-stationary nature of the market. Our approach, MetaTrader, frames portfolio optimization as a new type of partial-offline RL problem and makes two technical contributions. First, MetaTrader employs a bilevel learning framework that explicitly trains the RL agent to improve both in-domain profits on the original dataset and out-of-domain performance across diverse transformations of the raw financial data. Second, our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStock Market Forecasting Methods · Advanced Bandit Algorithms Research · Risk and Portfolio Optimization