TL;DR
The paper introduces TTF-VLA, a novel temporal token fusion method that enhances vision-language-action models by integrating historical visual data, leading to improved performance and robustness in robotic manipulation tasks.
Contribution
It presents a training-free, selective temporal token fusion approach that leverages pixel difference and attention relevance, improving VLA model accuracy across various environments.
Findings
Achieved 4.0% average improvement on LIBERO dataset
Demonstrated 4.8% relative improvement on SimplerEnv
Real robot tasks saw an 8.7% relative performance boost
Abstract
Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
