Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization

Matthew Landers; Taylor W. Killian; Thomas Hartvigsen; Afsaneh Doryab

arXiv:2601.04441·cs.LG·January 29, 2026

Improving and Accelerating Offline RL in Large Discrete Action Spaces with Structured Policy Initialization

Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab

PDF

Open Access 3 Reviews

TL;DR

This paper introduces SPIN, a two-stage framework for offline reinforcement learning in large discrete action spaces, which improves performance and training speed by leveraging structured policy initialization.

Contribution

The paper proposes a novel two-stage approach, SPIN, that pre-trains an action structure model and then trains lightweight policies, enhancing efficiency and effectiveness in large discrete action spaces.

Findings

01

SPIN improves average return by up to 39% over state-of-the-art methods.

02

SPIN reduces training time to convergence by up to 12.8 times.

03

Effective handling of combinatorial action spaces in offline RL.

Abstract

Reinforcement learning in discrete combinatorial action spaces requires searching over exponentially many joint actions to simultaneously select multiple sub-actions that form coherent combinations. Existing approaches either simplify policy learning by assuming independence across sub-actions, which often yields incoherent or invalid actions, or attempt to learn action structure and control jointly, which is slow and unstable. We introduce Structured Policy Initialization (SPIN), a two-stage framework that first pre-trains an Action Structure Model (ASM) to capture the manifold of valid actions, then freezes this representation and trains lightweight policy heads for control. On challenging discrete DM Control benchmarks, SPIN improves average return by up to 39% over the state of the art while reducing time to convergence by up to 12.8 $\times$ .

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The method is very simple and intuitive. By doing self supervised learning on the large action space, one can learn a more meaningful action representation than the original one. This will lead to large downstream gains. - The paper is well written and easy to understand. - The experiment section has interesting analysis results to pin down why SPIN is helpful.

Weaknesses

### Empirical evaluation feels toy and contrived - these methods are all evaluated in rather artificial RL tasks (hopper, quadruped, etc.), where they take a popular benchmark (DM Control) and then factorize the action space. While useful for fast iteration and initial scientific insight, it is insufficient for convincing me that this method, or even the problem of large discrete action space, is useful. The authors motivated the problem by citing natural problems with large action spaces like r

Reviewer 02Rating 4Confidence 3

Strengths

By separating the learning of action structure and policy, the proposed algorithm overcomes the computational cost issue that a previous work named SAINT has.

Weaknesses

Determining when to finish pretraining and move on to policy training is crucial. Stopping pretraining too early could lead to poor action structure modeling (Sec 6.1 illustrates the importance of sufficient pretraining), and stopping pretraining too late could lead to the same computational cost issue that is with SAINT. The authors do not provide an approach to choose the stopping time of pretraining. The proposed method largely reuses the policy architecture in SAINT, and thus the novelty o

Reviewer 03Rating 4Confidence 3

Strengths

The paper is clearly written and well motivated. The proposed idea is straightforward, and the algorithm is compatible with actor–critic frameworks, which enhances its applicability across a wide range of settings. The experimental results demonstrate the superiority of the proposed approach compared with the three selected baselines.

Weaknesses

While the focus on offline RL is relevant, it is not sufficiently justified in the paper. In particular, SAINT is originally an online approach, which has been used here in an offline setting for comparison. In my view, it is not entirely fair to claim that SAINT jointly learns the action structure and control, as it was designed for a different purpose. This raises questions about the validity of the comparison. The evaluation is also somewhat limited. The implementation details for the select

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning