SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass

Chen Qian; Xinran Yu; Danyang Li; Guoxuan Chi; Zheng Yang; Qiang Ma; Xin Miao

arXiv:2602.03134·cs.CV·February 4, 2026

SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass

Chen Qian, Xinran Yu, Danyang Li, Guoxuan Chi, Zheng Yang, Qiang Ma, Xin Miao

PDF

Open Access

TL;DR

SwiftVLM introduces a layer-wise token bypass approach for vision-language models, enabling more accurate and efficient visual token pruning without early irreversible decisions, thus improving performance on fine-grained tasks.

Contribution

The paper proposes a novel bypass pruning paradigm and SwiftVLM method that preserve and re-evaluate visual tokens across layers, enhancing pruning flexibility and accuracy.

Findings

01

Outperforms existing pruning strategies across multiple benchmarks.

02

Achieves better accuracy-efficiency trade-offs.

03

Demonstrates more faithful visual token selection behavior.

Abstract

Visual token pruning is a promising approach for reducing the computational cost of vision-language models (VLMs), and existing methods often rely on early pruning decisions to improve efficiency. While effective on coarse-grained reasoning tasks, they suffer from significant performance degradation on tasks requiring fine-grained visual details. Through layer-wise analysis, we reveal substantial discrepancies in visual token importance across layers, showing that tokens deemed unimportant at shallow layers can later become highly relevant for text-conditioned reasoning. To avoid irreversible critical information loss caused by premature pruning, we introduce a new pruning paradigm, termed bypass, which preserves unselected visual tokens and forwards them to subsequent pruning stages for re-evaluation. Building on this paradigm, we propose SwiftVLM, a simple and training-free method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications