TL;DR
FrameSkip selectively samples high-importance frames from robot demonstration trajectories to improve vision-language-action policy training efficiency and success rates.
Contribution
Introduces a data-layer frame selection method that enhances training by focusing on critical frames without altering model architecture or training procedures.
Findings
Achieves a 76.15% success rate across benchmarks with only 20% of frames retained.
Outperforms full-frame training and simpler frame selection methods.
Improves success-retention trade-off in VLA policy training.
Abstract
Vision-Language-Action (VLA) policies are commonly trained from dense robot demonstration trajectories, often collected through teleoperation, by sampling every recorded frame as if it provided equally useful supervision. We argue that this convention creates a temporal supervision imbalance: long low-change segments dominate the training stream, while manipulation-critical transitions such as alignment, contact, grasping, and release appear only sparsely. We introduce FrameSkip, a data-layer frame selection framework that scores trajectory frames using action variation, visual-action coherence, task-progress priors, and gripper-transition preservation, then remaps training samples toward high-importance frames under a target retention ratio. Because FrameSkip operates only in the dataloader, it leaves the VLA architecture, action head, training objective, and inference procedure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
