Primary-Fine Decoupling for Action Generation in Robotic Imitation

Xiaohan Lei; Min Wang; Wengang Zhou; Xingyu Lu; Houqiang Li

arXiv:2602.21684·cs.RO·February 26, 2026

Primary-Fine Decoupling for Action Generation in Robotic Imitation

Xiaohan Lei, Min Wang, Wengang Zhou, Xingyu Lu, Houqiang Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces PF-DAG, a two-stage framework that improves robotic action generation by decoupling coarse mode selection from fine-grained continuous action synthesis, leading to more stable and accurate imitation learning.

Contribution

The paper proposes a novel two-stage decoupling approach for multi-modal action generation in robotics, combining discrete mode selection with continuous action refinement, and provides theoretical and empirical validation.

Findings

01

PF-DAG achieves lower MSE bounds than single-stage policies.

02

Outperforms state-of-the-art methods across 56 manipulation tasks.

03

Successfully generalizes to real-world tactile dexterous manipulation.

Abstract

Multi-modal distribution in robotic manipulation action sequences poses critical challenges for imitation learning. To this end, existing approaches often model the action space as either a discrete set of tokens or a continuous, latent-variable distribution. However, both approaches present trade-offs: some methods discretize actions into tokens and therefore lose fine-grained action variations, while others generate continuous actions in a single stage tend to produce unstable mode transitions. To address these limitations, we propose Primary-Fine Decoupling for Action Generation (PF-DAG), a two-stage framework that decouples coarse action consistency from fine-grained variations. First, we compress action chunks into a small set of discrete modes, enabling a lightweight policy to select consistent coarse modes and avoid mode bouncing. Second, a mode conditioned MeanFlow policy is…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

- The writing is good and the paper is easy to follow. - The two-stage design that separates coarse mode selection from fine residual generation is clear and practical. The one-step MeanFlow decoder offers compelling speed advantages while maintaining accuracy. - Empirical results are broad and consistently favorable across many simulated tasks, and the real-world demos indicate potential practical relevance. The ablations on K and tokenization provide preliminary guidance on design choices.

Weaknesses

- The greedy mode selection without temporal smoothing can still make mode-bouncing happen and it highly relies on a carefully selected #mode hyperparam - Several design choice (VQ-VAE, MeanFlow) seems only provide marginal improvement based on the ablation studies. - Clustering-based coarse-to-fine/mode-selection mechanism has been explored in other robotics domain such as motion forecasting in autonomous driving, the novelty of this paper is somewhat limited.

Reviewer 02Rating 6Confidence 3

Strengths

The paper tackles a key challenge in imitation learning, modeling multi-modal action distributions, with a clear two-stage formulation. The Primary-Fine Decoupling design (discrete mode selection + continuous residual generation) is simple yet effective, and leads to more stable, temporally consistent control compared to other diffusion or flow-based baselines. The paper is well-supported by both theoretical justification and empirical evidence, including a variance-based MSE bound analysis, a

Weaknesses

While the paper attributes instability to "mode bouncing", it does not empirically compare against simpler baselines that include observation or action history. In practice, conditioning on temporal history or predicting waypoints (as done in prior works) can also effectively reduce mode switching. Also, some other related approaches such as "Behavior Generation with Latent Actions" (ICML 2024) and "Hierarchical Diffusion Policy: Manipulation Trajectory Generation via Contact Guidance" (T-RO) sh

Reviewer 03Rating 4Confidence 4

Strengths

The proposed approach shows promising results on a variety of tasks both in simulation and in the real world, not just in terms of accuracy but also in capturing the multimodality effectively. The hierarchical disambiguation helps prevent "mode bouncing" which allows for a conditional generation of the fine-grained actions. The use of MeanFlow as compared to Flow Matching also shows a good improvement which adds some novelty. The paper is written well and clear to follow.

Weaknesses

While the proposed approach shows promise, this approach of hierarchical policy structures for robot learning has been well studied in previous approaches, yet there is no mention or comparison against previous approaches. Only some discussion on Imitation Learning and Action space discretization is presented. There is no comparison against any hierarchical approach, but rather against standard end-to-end visuomotor policy learning approaches like using Diffusion or Flow Matching. Many previous

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Reinforcement Learning in Robotics · Generative Adversarial Networks and Image Synthesis