N\"uwa: Mending the Spatial Integrity Torn by VLM Token Pruning

Yihong Huang; Fei Ma; Yihua Shao; Jingcai Guo; Zitong Yu; Laizhong Cui; Qi Tian

arXiv:2602.02951·cs.CV·February 4, 2026

N\"uwa: Mending the Spatial Integrity Torn by VLM Token Pruning

Yihong Huang, Fei Ma, Yihua Shao, Jingcai Guo, Zitong Yu, Laizhong Cui, Qi Tian

PDF

Open Access

TL;DR

N"uwa is a novel two-stage token pruning framework that preserves spatial integrity in vision language models, significantly improving performance on visual grounding and question answering tasks.

Contribution

It introduces a new token pruning method that maintains spatial information, addressing limitations of existing approaches in VLMs.

Findings

01

Achieves state-of-the-art results on VQA benchmarks (94%-95%).

02

Substantially improves visual grounding performance (7% to 47%).

03

Maintains spatial integrity while pruning tokens effectively.

Abstract

Vision token pruning has proven to be an effective acceleration technique for the efficient Vision Language Model (VLM). However, existing pruning methods demonstrate excellent performance preservation in visual question answering (VQA) and suffer substantial degradation on visual grounding (VG) tasks. Our analysis of the VLM's processing pipeline reveals that strategies utilizing global semantic similarity and attention scores lose the global spatial reference frame, which is derived from the interactions of tokens' positional information. Motivated by these findings, we propose $N \overset{u}{¨} wa$ , a two-stage token pruning framework that enables efficient feature aggregation while maintaining spatial integrity. In the first stage, after the vision encoder, we apply three operations, namely separation, alignment, and aggregation, which are inspired by swarm intelligence algorithms to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques