RFS: Reinforcement Learning with Residual Flow Steering for Dexterous Manipulation
Entong Su, Tyler Westenbroek, Anusha Nagabandi, Abhishek Gupta

TL;DR
This paper introduces RFS, a reinforcement learning framework that efficiently adapts pretrained generative policies for dexterous manipulation by combining residual corrections and latent-space exploration.
Contribution
RFS is a novel method that enables rapid, data-efficient adaptation of pretrained flow-matching policies through residual steering and latent modulation.
Findings
Effective fine-tuning in simulation and real-world tasks
Preserves the expressive structure of pretrained policies
Enhances exploration during policy adaptation
Abstract
Imitation learning has emerged as an effective approach for bootstrapping sequential decision-making in robotics, achieving strong performance even in high-dimensional dexterous manipulation tasks. Recent behavior cloning methods further leverage expressive generative models, such as diffusion models and flow matching, to represent multimodal action distributions. However, policies pretrained in this manner often exhibit limited generalization and require additional fine-tuning to achieve robust performance at deployment time. Such adaptation must preserve the global exploration benefits of pretraining while enabling rapid correction of local execution errors. We propose Residual Flow Steering(RFS), a data-efficient reinforcement learning framework for adapting pretrained generative policies. RFS steers a pretrained flow-matching policy by jointly optimizing a residual action and a…
Peer Reviews
Decision·ICLR 2026 Poster
RFS is a clean unification of residual RL (output modulation) and latent steering (input modulation), formalized via a modulation policy. Strong, consistent improvements over baselines and over residual‑only / steering‑only ablations in simulation and better real‑world success vs. zero‑shot and supervised fine‑tuning. The paper is well written. The introduction and the method are easy to follow and communicate the contributions and the implementation well.
Limited novelty. The algorithm mainly combines two widely used adaptation strategies (residual action learning and latent steering). Baseline coverage. Real‑world evaluation lacks comparisons to other RL fine‑tuning approaches for diffusion/flow policies (e.g., recent flow‑RL fine‑tuners); most comparisons are ablations or action‑space baselines. Task scope. Validation centers on grasping; the paper claims broader applicability, but no additional manipulation tasks are shown.
1. The paper integrates two previous methods (flow steering, residual learning) with complementary benefits and drawbacks to get the best features of both. 2. The framework the paper introduces is broad enough to be applicable to many important reinforcement learning applications, even outside the domain of manipulation, especially given the wide adoption of generative policies in various RL applications.
1. **Some Baseline choices lack motivation/clarity:** I do not understand the use of the VQ-VAE and PCA baselines in Section 6.1.2/Table 1/Fig. 4. From my understanding, both are methods to get different state-action representations, whereas the focus of the paper is finetuning. If my understanding is correct, these results are comparing PPO finetuned with RFS with PPO trained for two separate state-action representations (made by PCA, VQ-VAE). Why is such a comparison meaningful? 2. **Seeds f
- The unification between residual RL and latent-noise steering is interesting, and can open up avenues for various choices of $f$ and $g$. - In the offline RL setting, RFS appears to generalize to unknown objects better in the real-life setting. The extra robustness experiments are helpful in demonstrating RFS' benefits.
- In section 5.2, the paper proposes to collect extra human correction data $a$---this seems to be a strong limitation, similar to applying the DAgger algorithm. It is possible that I have totally misunderstood this process: - The whole trajectory $((o_1, s_1), (o_2, s_2), \dots)$ is generated using the correction action $a$, as opposed to below. - First sample the trajectory $((o_1, s_1), (o_2, s_2), \dots)$ using the base policy actions, then obtain the correction actions based on the alread
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis
