Action-Free Offline-to-Online RL via Discretised State Policies

Natinael Solomon Neggatu; Jeremie Houssineau; Giovanni Montana

arXiv:2602.00629·stat.ML·February 3, 2026

Action-Free Offline-to-Online RL via Discretised State Policies

Natinael Solomon Neggatu, Jeremie Houssineau, Giovanni Montana

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a new offline-to-online reinforcement learning framework that learns from datasets without actions by focusing on state transitions, using discretisation and pre-trained state policies to improve online learning efficiency.

Contribution

It proposes a novel state discretisation transformation and a value-based algorithm, Offline State-Only DecQN, for pre-training from action-free data, and introduces guided online learning leveraging these policies.

Findings

01

Improves convergence speed in online RL tasks.

02

Enhances asymptotic performance with action-free pre-training.

03

Discretisation and regularisation are key to effectiveness.

Abstract

Most existing offline RL methods presume the availability of action labels within the dataset, but in many practical scenarios, actions may be missing due to privacy, storage, or sensor limitations. We formalise the setting of action-free offline-to-online RL, where agents must learn from datasets consisting solely of $(s, r, s^{'})$ tuples and later leverage this knowledge during online interaction. To address this challenge, we propose learning state policies that recommend desirable next-state transitions rather than actions. Our contributions are twofold. First, we introduce a simple yet novel state discretisation transformation and propose Offline State-Only DecQN (\algo), a value-based algorithm designed to pre-train state policies from action-free data. \algo{} integrates the transformation to scale efficiently to high-dimensional problems while avoiding instability and overfitting…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

The problem setting is interesting, action-free datasets would seem to be more common in practice and understudied. The discretization technique for the state space and converting to actions using an IDM is simple and surprisingly effective.

Weaknesses

Generally, there are no glaring issues but some experimental and design choices are unclear to me. I have included these questions below in the "Questions" section. Many of these are clarification questions or ablation suggestions and I am happy to increase my score after having more information.

Reviewer 02Rating 6Confidence 3

Strengths

1. The motivation for action-free RL is strong, while the explanation for the motivation is clear in the paper. 2. The authors propose an algorithm with novel technical designs. 3. The experimental results are generally good. The ablation study part is helpful.

Weaknesses

1. The paper directly decouples the Q function. Under the case where the real Q function depends on the correlation between different dimensions, could this approximation perform well? Meanwhile, for the argmax of the action, I am wondering what is the choice when $\mathcal{A}$ is not a product space. For instance, if $\mathcal{A}$ is a unit ball, how to define the argmax on each dimension? 2. For the training loss of IDM, could you please provide some motivation for that? Does the success of s

Reviewer 03Rating 4Confidence 3

Strengths

1. The proposed method is conceptually clear and technically sound, addressing a problem where offline datasets lack action labels. 2. The methodology is well-designed, combining existing offline RL techniques with novel discretization and regularization components in a coherent way. 3. The ablation studies are comprehensive and clearly demonstrate the contribution of each component.

Weaknesses

1. The motivation for the action-free offline data setting is not clearly justified. The paper should better explain when and why such data would realistically occur. 2. The examples in the introduction are not entirely convincing. The most plausible application would be robotics from video, but the experiments only consider relatively small state spaces (up to 78 dimensions), which limits the realism of the claim. 3. The baseline coverage is limited — the paper primarily compares against a sing

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition