Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

Calvin Galagain; Martyna Poreba; and Fran\c{c}ois Goulette

arXiv:2605.18177·cs.CV·May 19, 2026

Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

Calvin Galagain, Martyna Poreba, and Fran\c{c}ois Goulette

PDF

TL;DR

TokenMask introduces a token-space mask head for Vision Transformer segmentation, eliminating the need for explicit image-space reconstruction, thereby enhancing efficiency and deployment simplicity across various models and tasks.

Contribution

The paper proposes a novel token-space mask head that computes mask logits directly from query-token affinities, simplifying the architecture and improving efficiency.

Findings

01

TokenMask reduces computational and memory requirements.

02

It maintains competitive accuracy across diverse datasets and models.

03

It achieves tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16.

Abstract

Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.