TL;DR
AttenA+ enhances robotic foundation models by emphasizing critical action segments based on velocity, aligning training with physical task demands, and improving performance on benchmarks and real-world tasks.
Contribution
Introduces AttenA+, a velocity-driven attention framework that reweights training focus on critical action segments without structural changes, boosting model effectiveness.
Findings
Improves Libero benchmark accuracy to 98.6%.
Enhances RoboTwin 2.0 performance to 92.4%.
Demonstrates robustness on real-world robotic tasks.
Abstract
Existing robotic foundation models, while powerful, are predicated on an implicit assumption of temporal homogeneity: treating all actions as equally informative during optimization. This "flat" training paradigm, inherited from language modeling, remains indifferent to the underlying physical hierarchy of manipulation. In reality, robot trajectories are fundamentally heterogeneous, where low-velocity segments often dictate task success through precision-demanding interactions, while high-velocity motions serve as error-tolerant transitions. Such a misalignment between uniform loss weighting and physical criticality fundamentally limits the performance of current Vision-Language-Action (VLA) models and World-Action Models (WAM) in complex, long-horizon tasks. To rectify this, we introduce AttenA+, an architecture-agnostic framework that prioritizes kinematically critical segments via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
