RFS: Reinforcement Learning with Residual Flow Steering for Dexterous Manipulation

Entong Su; Tyler Westenbroek; Anusha Nagabandi; Abhishek Gupta

arXiv:2602.01789·cs.RO·February 6, 2026

RFS: Reinforcement Learning with Residual Flow Steering for Dexterous Manipulation

Entong Su, Tyler Westenbroek, Anusha Nagabandi, Abhishek Gupta

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RFS, a reinforcement learning framework that efficiently adapts pretrained generative policies for dexterous manipulation by combining residual corrections and latent-space exploration.

Contribution

RFS is a novel method that enables rapid, data-efficient adaptation of pretrained flow-matching policies through residual steering and latent modulation.

Findings

01

Effective fine-tuning in simulation and real-world tasks

02

Preserves the expressive structure of pretrained policies

03

Enhances exploration during policy adaptation

Abstract

Imitation learning has emerged as an effective approach for bootstrapping sequential decision-making in robotics, achieving strong performance even in high-dimensional dexterous manipulation tasks. Recent behavior cloning methods further leverage expressive generative models, such as diffusion models and flow matching, to represent multimodal action distributions. However, policies pretrained in this manner often exhibit limited generalization and require additional fine-tuning to achieve robust performance at deployment time. Such adaptation must preserve the global exploration benefits of pretraining while enabling rapid correction of local execution errors. We propose Residual Flow Steering(RFS), a data-efficient reinforcement learning framework for adapting pretrained generative policies. RFS steers a pretrained flow-matching policy by jointly optimizing a residual action and a…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 3

Strengths

RFS is a clean unification of residual RL (output modulation) and latent steering (input modulation), formalized via a modulation policy. Strong, consistent improvements over baselines and over residual‑only / steering‑only ablations in simulation and better real‑world success vs. zero‑shot and supervised fine‑tuning. The paper is well written. The introduction and the method are easy to follow and communicate the contributions and the implementation well.

Weaknesses

Limited novelty. The algorithm mainly combines two widely used adaptation strategies (residual action learning and latent steering). Baseline coverage. Real‑world evaluation lacks comparisons to other RL fine‑tuning approaches for diffusion/flow policies (e.g., recent flow‑RL fine‑tuners); most comparisons are ablations or action‑space baselines. Task scope. Validation centers on grasping; the paper claims broader applicability, but no additional manipulation tasks are shown.

Reviewer 02Rating 6Confidence 3

Strengths

1. The paper integrates two previous methods (flow steering, residual learning) with complementary benefits and drawbacks to get the best features of both. 2. The framework the paper introduces is broad enough to be applicable to many important reinforcement learning applications, even outside the domain of manipulation, especially given the wide adoption of generative policies in various RL applications.

Weaknesses

1. **Some Baseline choices lack motivation/clarity:** I do not understand the use of the VQ-VAE and PCA baselines in Section 6.1.2/Table 1/Fig. 4. From my understanding, both are methods to get different state-action representations, whereas the focus of the paper is finetuning. If my understanding is correct, these results are comparing PPO finetuned with RFS with PPO trained for two separate state-action representations (made by PCA, VQ-VAE). Why is such a comparison meaningful? 2. **Seeds f

Reviewer 03Rating 4Confidence 2

Strengths

- The unification between residual RL and latent-noise steering is interesting, and can open up avenues for various choices of $f$ and $g$. - In the offline RL setting, RFS appears to generalize to unknown objects better in the real-life setting. The extra robustness experiments are helpful in demonstrating RFS' benefits.

Weaknesses

- In section 5.2, the paper proposes to collect extra human correction data $a$---this seems to be a strong limitation, similar to applying the DAgger algorithm. It is possible that I have totally misunderstood this process: - The whole trajectory $((o_1, s_1), (o_2, s_2), \dots)$ is generated using the correction action $a$, as opposed to below. - First sample the trajectory $((o_1, s_1), (o_2, s_2), \dots)$ using the base policy actions, then obtain the correction actions based on the alread

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Robot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis