Token-Space Mask Prediction for Efficient Vision Transformer Segmentation
Calvin Galagain, Martyna Poreba, and Fran\c{c}ois Goulette

TL;DR
TokenMask introduces a token-space mask head for Vision Transformer segmentation, eliminating the need for explicit image-space reconstruction, thereby enhancing efficiency and deployment simplicity across various models and tasks.
Contribution
The paper proposes a novel token-space mask head that computes mask logits directly from query-token affinities, simplifying the architecture and improving efficiency.
Findings
TokenMask reduces computational and memory requirements.
It maintains competitive accuracy across diverse datasets and models.
It achieves tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16.
Abstract
Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
