When would Vision-Proprioception Policies Fail in Robotic Manipulation?

Jingxian Lu; Wenke Xia; Yuxuan Wu; Zhiwu Lu; Di Hu

arXiv:2602.12032·cs.RO·February 13, 2026

When would Vision-Proprioception Policies Fail in Robotic Manipulation?

Jingxian Lu, Wenke Xia, Yuxuan Wu, Zhiwu Lu, Di Hu

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the limitations of vision-proprioception policies in robotic manipulation, revealing that proprioception dominates during motion transitions and proposing a gradient adjustment method to improve policy generalization.

Contribution

The paper introduces the GAP algorithm that adaptively modulates proprioception during training, enhancing the robustness and generalization of vision-proprioception policies.

Findings

01

GAP improves policy robustness in simulated and real-world tasks.

02

Proprioception dominates during motion transitions, limiting visual learning.

03

The method is effective across various robotic setups and models.

Abstract

Proprioceptive information is critical for precise servo control by providing real-time robotic states. Its collaboration with vision is highly expected to enhance performances of the manipulation policy in complex tasks. However, recent studies have reported inconsistent observations on the generalization of vision-proprioception policies. In this work, we investigate this by conducting temporally controlled experiments. We found that during task sub-phases that robot's motion transitions, which require target localization, the vision modality of the vision-proprioception policy plays a limited role. Further analysis reveals that the policy naturally gravitates toward concise proprioceptive signals that offer faster loss reduction when training, thereby dominating the optimization and suppressing the learning of the visual modality during motion-transition phases. To alleviate this, we…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. The paper tackles an important and practical challenge in robot learning—how to effectively combine vision and proprioception for efficient and accurate manipulation. 2. The empirical findings are well-motivated, and the proposed gradient adjustment method, GAP, is simple, interpretable, and effective. 3. Both simulation and real-world experiments demonstrate the effectiveness of GAP. The approach shows consistent gains over both vision-only and vision-proprioception baselines, and the impro

Weaknesses

Overall, this is a good paper for me, but there are a few minor concerns: 1. The paper lacks a deeper investigation into the form of the gradient adjustment function $(1-p)$ and the choice of key hyperparameters. Since the paper mentions that gradient adjustment is applied only during the early stage of training, it would be beneficial to discuss the rationale for choosing specific hyperparameters ($\alpha $, $\beta$, and the number of stages for applying GAP) and to include an ablation study on

Reviewer 02Rating 6Confidence 2

Strengths

- The key idea is that during optimization, proprio-encoder parameters are updated as $\omega_s^{j+1} \leftarrow \omega_s^j - \lambda (1 - \rho)\eta \nabla_{\omega_s^j} \mathcal{L}_{\text{BC}}$, where $\lambda$ controls scaling, $\eta$ is the learning rate, and $\rho \in [0,1]$ measures transition likelihood. A higher $\rho$ (likely transition) down-scales proprio gradients, forcing the visual encoder to learn those steps more effectively. GAP improves success rates across simulated and real tas

Weaknesses

- Hyperparameters ($\alpha$, $\beta$, $\lambda$) and LSTM size require task-specific tuning, which is a limitation of this work. - From my understanding, CPD/LSTM finds proprio change points, not visual evidence of new targets. Showing that $\rho_t$ tracks visual uncertainty (e.g., entropy over detectors) would strengthen the claim. - One minor concern is that the current phase-detection method and the gradient-adjustment strategy both rely on proprioceptive signals (joint positions, velocities,

Reviewer 03Rating 6Confidence 5

Strengths

1. A detailed analysis was conducted to investigate the reasons behind the suppressed learning of the visual modality; 2. Phase-guided gradient adjustment offers a principled approach to dynamic modality balancing; 3. Comprehensive evaluation across different environments, platforms, and models.

Weaknesses

The assessment lacks a direct evaluation of whether the visual modality is better utilized. While performance metrics provide some evidence, incorporating qualitative visualizations or explainable analysis of the GAP-trained model would more effectively demonstrate the method's effectiveness to readers.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Teleoperation and Haptic Systems