TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Chenghao Liu; Jiachen Zhang; Chengxuan Li; Zhimu Zhou; Shixin Wu; Songfang Huang; Huiling Duan

arXiv:2508.19257·cs.CV·November 17, 2025

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models

Chenghao Liu, Jiachen Zhang, Chengxuan Li, Zhimu Zhou, Shixin Wu, Songfang Huang, Huiling Duan

PDF

1 Video

TL;DR

The paper introduces TTF-VLA, a novel temporal token fusion method that enhances vision-language-action models by integrating historical visual data, leading to improved performance and robustness in robotic manipulation tasks.

Contribution

It presents a training-free, selective temporal token fusion approach that leverages pixel difference and attention relevance, improving VLA model accuracy across various environments.

Findings

01

Achieved 4.0% average improvement on LIBERO dataset

02

Demonstrated 4.8% relative improvement on SimplerEnv

03

Real robot tasks saw an 8.7% relative performance boost

Abstract

Vision-Language-Action (VLA) models process visual inputs independently at each timestep, discarding valuable temporal information inherent in robotic manipulation tasks. This frame-by-frame processing makes models vulnerable to visual noise while ignoring the substantial coherence between consecutive frames in manipulation sequences. We propose Temporal Token Fusion (TTF), a training-free approach that intelligently integrates historical and current visual representations to enhance VLA inference quality. Our method employs dual-dimension detection combining efficient grayscale pixel difference analysis with attention-based semantic relevance assessment, enabling selective temporal token fusion through hard fusion strategies and keyframe anchoring to prevent error accumulation. Comprehensive experiments across LIBERO, SimplerEnv, and real robot tasks demonstrate consistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TTF-VLA: Temporal Token Fusion via Pixel-Attention Integration for Vision-Language-Action Models· underline