Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
Mathieu Petitbois, R\'emy Portelas, Sylvain Lamprier

TL;DR
This paper introduces SCIQL, a novel offline reinforcement learning framework that effectively balances style alignment and task performance using a unified behavior style definition and innovative optimization techniques.
Contribution
It proposes a unified style definition and the SCIQL algorithm, combining goal-conditioned RL with a Gated Advantage Weighted Regression for improved style and task performance.
Findings
SCIQL outperforms prior offline methods in style and task metrics.
The framework effectively balances style preservation with high task performance.
Experimental results validate the superiority of SCIQL across various benchmarks.
Abstract
We study offline reinforcement learning of style-conditioned policies using explicit style supervision via subtrajectory labeling functions. In this setting, aligning style with high task performance is particularly challenging due to distribution shift and inherent conflicts between style and reward. Existing methods, despite introducing numerous definitions of style, often fail to reconcile these objectives effectively. To address these challenges, we propose a unified definition of behavior style and instantiate it into a practical framework. Building on this, we introduce Style-Conditioned Implicit Q-Learning (SCIQL), which leverages offline goal-conditioned RL techniques, such as hindsight relabeling and value learning, and combine it with a new Gated Advantage Weighted Regression mechanism to efficiently optimize task performance while preserving style alignment. Experiments…
Peer Reviews
Decision·Submitted to ICLR 2026
This paper provides a unified formulation of behavioral style learning via programmatic sub-trajectory labeling, and introduces the SCIQL+GAWR framework that effectively balances style alignment and task performance in the offline RL setting.
The reliance on hand-crafted style labeling functions constrains scalability to more abstract or subtle styles, and may require domain expertise when applied to complex environments. The algorithmic pipeline is relatively intricate, increasing implementation burden, and evidence on large-scale real-world or high-dimensional robotic systems remains limited
The proposed GAWR mechanism and sub-trajectory labeling provide a simple yet effective way to integrate style supervision into offline RL. Empirical results on Circle2D and HalfCheetah environments show that SCIQL consistently achieves higher style alignment scores compared to the baselines.
1. The problem formulation is conceptually unclear. If style alignment and task reward are inherently conflicting, the object should be to balance the trade-off between the two. However, the current formulation seems to sacrifice task reward to increase style conformity, which raises the question of whether this trad-off is explicitly modeled. 2. Given that style alignment and task reward clearly conflict as shown in Section 5.3, the evaluation might be better framed in a Pareto optimality con
* The proposed solution is quite simple and sound. * The effectiveness of the proposed method is clear.
* I think the presentation can be improved if the authors moved some of the plots in the appendix to the main paper. * Some details in the method can be better explained. * I find the need to tune the temperature parameter and its sensitivity a downside of the proposed method.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
