TL;DR
This paper introduces GridS, a differentiable, geometry-aware token resampling method that drastically reduces computational costs in vision-language-action models without sacrificing performance.
Contribution
It proposes a novel, continuous token resampling module, GridS, that preserves critical spatial information while enabling significant compression in VLA models.
Findings
Achieves over 76% reduction in FLOPs with no success rate loss.
Preserves essential geometric details with fewer than 10% of original tokens.
Demonstrates effectiveness on LIBERO benchmark and real robotic platform.
Abstract
Vision-Language-Action (VLA) models have shown remarkable promise in robotics manipulation, yet their high computational cost hinders real-time deployment. Existing token pruning methods suffer from a fundamental trade-off: aggressive compression using pruning inevitably discards critical geometric details like contact points, leading to severe performance degradation. This forces a compromise, limiting the achievable compression rate and thus the potential speedup. We argue that breaking this trade-off requires rethinking compression as a geometry-aware, continuous token resampling in the vision encoder. To this end, we propose the Differentiable Grid Sampler (GridS), a plug-and-play module that performs task-aware, continuous resampling of visual tokens in VLA. By adaptively predicting a minimal set of salient coordinates and extracting features via differentiable interpolation, GridS…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
