CanViT: Toward Active-Vision Foundation Models

Yoha\"i-Eliel Berreby; Sabrina Du; Audrey Durand; B. Suresh Krishna

arXiv:2603.22570·cs.CV·May 19, 2026

CanViT: Toward Active-Vision Foundation Models

Yoha\"i-Eliel Berreby, Sabrina Du, Audrey Durand, B. Suresh Krishna

PDF

1 Repo 28 Models

TL;DR

CanViT introduces a scalable, task- and policy-agnostic active vision foundation model that leverages novel attention mechanisms and pretraining schemes to outperform existing models in scene understanding and classification tasks.

Contribution

The paper presents CanViT, the first scalable, task- and policy-agnostic active vision foundation model with novel attention and pretraining methods, achieving state-of-the-art results.

Findings

01

CanViT outperforms previous active models on ADE20K segmentation with fewer FLOPs.

02

Pretraining on 13.2 million scenes enables effective active vision modeling.

03

CanViT achieves 84.5% top-1 accuracy on ImageNet-1k after fine-tuning.

Abstract

Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable general-purpose architectures and pretraining pipelines, leaving Active-Vision Foundation Models (AVFMs) underexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve fast sequential inference and scalability to high output resolutions. We propose a label-free active vision pretraining scheme, policy-agnostic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

m2b3/CanViT-PyTorch
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Domain Adaptation and Few-Shot Learning · Neural Networks and Reservoir Computing