ARTA: Adaptive Mixed-Resolution Token Allocation for Efficient Dense Feature Extraction
David Hagerman, Roman Naeem, Erik Brorsson, Fredrik Kahl, Lennart Svensson

TL;DR
ARTA introduces a mixed-resolution transformer that adaptively allocates tokens based on semantic boundaries, significantly reducing computation while maintaining high accuracy in dense feature extraction.
Contribution
It proposes a novel coarse-to-fine token allocation method with a lightweight allocator, improving efficiency and accuracy over existing dense vision transformers.
Findings
Achieves state-of-the-art results on ADE20K and COCO-Stuff with fewer FLOPs.
Maintains competitive performance on Cityscapes with lower computational cost.
Attains 54.6 mIoU on ADE20K with fewer parameters and less memory.
Abstract
We present ARTA, a mixed-resolution coarse-to-fine vision transformer for efficient dense feature extraction. Unlike models that begin with dense high-resolution (fine) tokens, ARTA starts with low-resolution (coarse) tokens and uses a lightweight allocator to predict which regions require more fine tokens. The allocator iteratively predicts a semantic (class) boundary score and allocates additional tokens to patches above a low threshold, concentrating token density near boundaries while maintaining high sensitivity to weak boundary evidence. This targeted allocation encourages tokens to represent a single semantic class rather than a mixture of classes. Mixed-resolution attention enables interaction between coarse and fine tokens, focusing computation on semantically complex areas while avoiding redundant processing in homogeneous regions. Experiments demonstrate that ARTA achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
