FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim; Youngdo Lee; Minho Park; Kinam Kim; I Made Aswin Nahendra; Takuma Seno; Sehee Min; Daniel Palenicek; Florian Vogt; Danica Kragic; Jan Peters; Jaegul Choo; Hojoon Lee

arXiv:2604.04539·cs.LG·May 18, 2026

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee

PDF

1 Repo

TL;DR

FlashSAC is a novel off-policy reinforcement learning algorithm that achieves fast, stable, and efficient high-dimensional robot control by scaling models and data throughput while controlling error accumulation.

Contribution

It introduces FlashSAC, which reduces gradient updates and explicitly bounds norms to improve stability and efficiency in high-dimensional RL tasks.

Findings

01

Outperforms PPO and baselines on 60+ tasks in simulators.

02

Reduces training time from hours to minutes in sim-to-real humanoid locomotion.

03

Achieves superior performance and efficiency, especially in high-dimensional tasks.

Abstract

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

holiday-robot/FlashSAC
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.