DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

Chenyang Li; Jieyuan Liu; Bin Li; Bo Gao; Yilin Yuan; Yangfan He; Yuchen Li; Jingqun Tang

arXiv:2601.16065·cs.CV·January 23, 2026

DTP: A Simple yet Effective Distracting Token Pruning Framework for Vision-Language Action Models

Chenyang Li, Jieyuan Liu, Bin Li, Bo Gao, Yilin Yuan, Yangfan He, Yuchen Li, Jingqun Tang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces DTP, a plug-and-play framework that prunes distracting image tokens in vision-language action models to improve task success rates without altering the original architecture.

Contribution

The paper proposes a simple, effective token pruning method that enhances VLA model performance by dynamically removing irrelevant tokens, demonstrating generalizability across different models.

Findings

01

DTP improves task success rates across various VLA models.

02

A negative correlation exists between attention to irrelevant regions and task success.

03

The method is effective without changing the original model architecture.

Abstract

Vision-Language Action (VLA) models have shown remarkable progress in robotic manipulation by leveraging the powerful perception abilities of Vision-Language Models (VLMs) to understand environments and directly output actions. However, by default, VLA models may overly attend to image tokens in the task-irrelevant region, which we describe as 'distracting tokens'. This behavior can disturb the model from the generation of the desired action tokens in each step, affecting the success rate of tasks. In this paper, we introduce a simple yet effective plug-and-play Distracting Token Pruning (DTP) framework, which dynamically detects and prunes these distracting image tokens. By correcting the model's visual attention patterns, we aim to improve the task success rate, as well as exploring the performance upper boundaries of the model without altering its original architecture or adding…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper presents a very interesting observation that visual backbones of VLAs tend to attend to unimportant regions in the images which lead to failure cases and then propose a super simple fix based on identifying important regions and thresholding the attention weights. 2. The paper presents results on multiple models showing that the issue is prevalent and the fix is general enough to work across these models.

Weaknesses

***1. Results only on SimplerEnv.*** I would have liked to see this study replicated on a larger set of environments and tasks. Note that most of the VLAs are not actually trained on SimplerEnv and are evaluated zero-shot. It would be interesting to see if this effect is observed in cases where the VLA is trained on the environment as well. Not saying doing evals on SimplerEnv is not valuable but it will be interesting to see the results on other benchmarks too! Similarly it would be also very

Reviewer 02Rating 4Confidence 3

Strengths

* **Simplicity and generality**: DTP is a simple, plug-and-play method that, assuming a standard VLA architecture, does not require architectural changes or additional inputs, making it broadly applicable to existing VLA models. * **Interpretability**: other than slightly improving performance, the approach helps visualize and understand model attention, potentially aiding debugging and further research on semantic information within the VLA framework.

Weaknesses

* **Modest improvements**: the overall increase in success rate across the Simpler environments is about ~4%. While the improvements seem to be consistent, the performance gap from optimality is still quite large and this approach does not seem to substantially advance the current state of VLAs. * **Slower training/inference**: as stated by the authors the method introduces a (small) overhead during training and inference. Given the small performance improvements, this hinders the adoptability o

Reviewer 03Rating 4Confidence 4

Strengths

- The paper proposes a principled way to extract important task-relevant vision tokens and utilizes this to prune tokens. This adds a new tool to VLA inference that allows boosting success rates without the need for finetuning. - The method is tested on a good diversity of VLAs with various VLM backbones such as Paligemma 2, Qwen2.5VL and UniVLA, demonstrating the generality of the approach. - An analysis of attention patterns is presented, which is valuable from an interpretability perspective

Weaknesses

- While the results on the SIMPLER benchmark are overall encouraging, relative success rate improvements on the Google robot tasks are minor, which makes the main result heavily rely on the WidowX robot tasks. Given the training-free nature of the approach, it would be important to test additional benchmarks with higher task diversities to confirm the efficacy of the method. - Given the small number of tasks, the number of hyperparameters of the method seem high, including the threshold $\tau$,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI