SAINT: Attention-Based Policies for Discrete Combinatorial Action Spaces
Matthew Landers, Taylor W. Killian, Thomas Hartvigsen, Afsaneh Doryab

TL;DR
SAINT is a novel transformer-based policy architecture that effectively models complex joint dependencies in large combinatorial action spaces, improving reinforcement learning performance across diverse environments.
Contribution
Introduces SAINT, a permutation-invariant transformer architecture for combinatorial actions, capturing dependencies and enhancing sample efficiency in RL.
Findings
Outperforms baselines in 18 environments
Handles up to 1.35 quintillion actions
Models complex joint action dependencies
Abstract
The combinatorial structure of many real-world action spaces leads to exponential growth in the number of possible actions, limiting the effectiveness of conventional reinforcement learning algorithms. Recent approaches for combinatorial action spaces impose factorized or sequential structures over sub-actions, failing to capture complex joint behavior. We introduce the Sub-Action Interaction Network using Transformers (SAINT), a novel policy architecture that represents multi-component actions as unordered sets and models their dependencies via self-attention conditioned on the global state. SAINT is permutation-invariant, sample-efficient, and compatible with standard policy optimization algorithms. In 18 distinct combinatorial environments across three task domains, including environments with possible actions, SAINT consistently outperforms strong baselines.
Peer Reviews
Decision·Submitted to ICLR 2026
The authors provide ablations showing the robustness of the proposed method on varying dimensionality and varying sub-action dependence.
The proposed method can have high computational costs. when action space is large, the learnable embedding vector e_i has high dimension. Adding state conditioning further increase the dimensionality.
- The proposed approach can model complex, context-sensitive dependencies in large action spaces. It is permutation invariant, i.e., naturally fits unordered action compositions. - The evaluation conducted is extensive and compelling. I appreciate the ablations. Experiments show that the proposed approach consistently outperforms baselines on diverse tasks: state-independent (traffic control), state-dependent (navigation), and weakly dependent (discretized MuJoCo). The scalability of the appro
- The approach may be less justified for low-dimensional or weakly structured domains. Suggestions: - Since combinatorial action spaces are common in offline RL (e.g., healthcare), systematic analysis in off-policy contexts could further establish SAINT's utility.
1. **Clarity:** The paper is written with outstanding clarity, making the problem, prior work, and the proposed method very easy to understand. 2. **Problem Formulation:** The authors correctly identify a key limitation of existing approaches, namely the rigid and often incorrect inductive bias of a fixed autoregressive ordering. 3. **Architectural Fit:** The idea of using a permutation-equivariant architecture is an elegant and principled solution for the *specific class of problems* where s
1. **Incremental Novelty:** The technical contribution is thin. The method consists of a known neural architecture (self-attention on an unordered set) plugged into a standard, on-policy algorithm (PPO). This is an exercise in architectural engineering, not a new method, and its novelty is limited. 2. **Fundamentally Questionable Inductive Bias:** The paper's entire motivation rests on the assumption that permutation-equivariance is a *universally desirable* property. This is a strong and, in
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Adversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI)
MethodsDense Connections · Feedforward Network · CutMix · Mixup · SAINT
