DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

Sixu Lin; Yunpeng Qing; Litao Liu; Ming Zhou; Ruixing Jin; Xiaoyi Fan; Guiliang Liu

arXiv:2605.17486·cs.RO·May 19, 2026

DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

Sixu Lin, Yunpeng Qing, Litao Liu, Ming Zhou, Ruixing Jin, Xiaoyi Fan, Guiliang Liu

PDF

TL;DR

DyGRO-VLA introduces a novel two-stage optimization framework that enhances the generalizability of vision-language-action models across multiple tasks by capturing and refining cross-task representations.

Contribution

The paper proposes DyGRO-VLA, a new method for multi-task RL optimization that improves cross-task feature learning and mitigates interference, leading to better generalization.

Findings

01

Consistent improvements on LIBERO and RoboTwin2 benchmarks.

02

Effective capture of cross-task latent representations.

03

Robust performance under distribution shifts.

Abstract

Recent progress in Reinforcement Learning (RL) provides a principled approach to optimizing Vision-Language-Action (VLA) models, facilitating a shift from trajectory imitation to active learning in the task environment. Despite improvements in control precision, most RL optimizers remain task-specific, which reduces VLA models from generalist controllers to policies that overfit to a narrow set of tasks. In this study, we conduct an in-depth analysis of this phenomenon and highlight the importance of cross-task feature representations for improving the generalizability of VLA models. Motivated by this finding, we introduce DyGRO-VLA, a two-stage optimization framework that 1) effectively captures cross-task latent representations based on information-theoretic principles, and 2) dynamically refines policy optimization via a mixture-of-RL-residuals. DyGRO-VLA enables the RL optimizer to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.