When would Vision-Proprioception Policies Fail in Robotic Manipulation?
Jingxian Lu, Wenke Xia, Yuxuan Wu, Zhiwu Lu, Di Hu

TL;DR
This paper investigates the limitations of vision-proprioception policies in robotic manipulation, revealing that proprioception dominates during motion transitions and proposing a gradient adjustment method to improve policy generalization.
Contribution
The paper introduces the GAP algorithm that adaptively modulates proprioception during training, enhancing the robustness and generalization of vision-proprioception policies.
Findings
GAP improves policy robustness in simulated and real-world tasks.
Proprioception dominates during motion transitions, limiting visual learning.
The method is effective across various robotic setups and models.
Abstract
Proprioceptive information is critical for precise servo control by providing real-time robotic states. Its collaboration with vision is highly expected to enhance performances of the manipulation policy in complex tasks. However, recent studies have reported inconsistent observations on the generalization of vision-proprioception policies. In this work, we investigate this by conducting temporally controlled experiments. We found that during task sub-phases that robot's motion transitions, which require target localization, the vision modality of the vision-proprioception policy plays a limited role. Further analysis reveals that the policy naturally gravitates toward concise proprioceptive signals that offer faster loss reduction when training, thereby dominating the optimization and suppressing the learning of the visual modality during motion-transition phases. To alleviate this, we…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper tackles an important and practical challenge in robot learning—how to effectively combine vision and proprioception for efficient and accurate manipulation. 2. The empirical findings are well-motivated, and the proposed gradient adjustment method, GAP, is simple, interpretable, and effective. 3. Both simulation and real-world experiments demonstrate the effectiveness of GAP. The approach shows consistent gains over both vision-only and vision-proprioception baselines, and the impro
Overall, this is a good paper for me, but there are a few minor concerns: 1. The paper lacks a deeper investigation into the form of the gradient adjustment function $(1-p)$ and the choice of key hyperparameters. Since the paper mentions that gradient adjustment is applied only during the early stage of training, it would be beneficial to discuss the rationale for choosing specific hyperparameters ($\alpha $, $\beta$, and the number of stages for applying GAP) and to include an ablation study on
- The key idea is that during optimization, proprio-encoder parameters are updated as $\omega_s^{j+1} \leftarrow \omega_s^j - \lambda (1 - \rho)\eta \nabla_{\omega_s^j} \mathcal{L}_{\text{BC}}$, where $\lambda$ controls scaling, $\eta$ is the learning rate, and $\rho \in [0,1]$ measures transition likelihood. A higher $\rho$ (likely transition) down-scales proprio gradients, forcing the visual encoder to learn those steps more effectively. GAP improves success rates across simulated and real tas
- Hyperparameters ($\alpha$, $\beta$, $\lambda$) and LSTM size require task-specific tuning, which is a limitation of this work. - From my understanding, CPD/LSTM finds proprio change points, not visual evidence of new targets. Showing that $\rho_t$ tracks visual uncertainty (e.g., entropy over detectors) would strengthen the claim. - One minor concern is that the current phase-detection method and the gradient-adjustment strategy both rely on proprioceptive signals (joint positions, velocities,
1. A detailed analysis was conducted to investigate the reasons behind the suppressed learning of the visual modality; 2. Phase-guided gradient adjustment offers a principled approach to dynamic modality balancing; 3. Comprehensive evaluation across different environments, platforms, and models.
The assessment lacks a direct evaluation of whether the visual modality is better utilized. While performance metrics provide some evidence, incorporating qualitative visualizations or explainable analysis of the GAP-trained model would more effectively demonstrate the method's effectiveness to readers.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Motor Control and Adaptation · Teleoperation and Haptic Systems
