TL;DR
CanViT introduces a scalable, task- and policy-agnostic active vision foundation model that leverages novel attention mechanisms and pretraining schemes to outperform existing models in scene understanding and classification tasks.
Contribution
The paper presents CanViT, the first scalable, task- and policy-agnostic active vision foundation model with novel attention and pretraining methods, achieving state-of-the-art results.
Findings
CanViT outperforms previous active models on ADE20K segmentation with fewer FLOPs.
Pretraining on 13.2 million scenes enables effective active vision modeling.
CanViT achieves 84.5% top-1 accuracy on ImageNet-1k after fine-tuning.
Abstract
Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines, leaving Active-Vision Foundation Models (AVFMs) underexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve fast sequential inference and scalability to high output resolutions. We propose a label-free active vision pretraining scheme, policy-agnostic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗canvit/canvitb16-add-vpe-pretrain-g128px-s512px-in21k-dv3b16-2026-02-02model· 1.4k dl· ♡ 31.4k dl♡ 3
- 🤗canvit/canvitb16-add-vpe-pretrain-g128px-s512px-in21k-dv3b16-2026-02-02-mlxmodel· 36 dl36 dl
- 🤗canvit/probe-ade20k-40k-s512-c32-in21kmodel· 291 dl291 dl
- 🤗canvit/probe-ade20k-40k-s1024-c64-in21kmodel· 49 dl49 dl
- 🤗canvit/probe-ade20k-40k-dv3b-128pxmodel· 7 dl7 dl
- 🤗canvit/probe-ade20k-40k-dv3b-144pxmodel· 4 dl4 dl
- 🤗canvit/probe-ade20k-40k-dv3b-160pxmodel· 6 dl6 dl
- 🤗canvit/probe-ade20k-40k-dv3b-192pxmodel· 5 dl5 dl
- 🤗canvit/probe-ade20k-40k-dv3b-256pxmodel· 5 dl5 dl
- 🤗canvit/probe-ade20k-40k-dv3b-384pxmodel· 5 dl5 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Domain Adaptation and Few-Shot Learning · Neural Networks and Reservoir Computing
