CoCAViT: Compact Vision Transformer with Robust Global Coordination

Xuyang Wang; Lingjuan Miao; Zhiqiang Zhou

arXiv:2508.05307·cs.CV·August 8, 2025

CoCAViT: Compact Vision Transformer with Robust Global Coordination

Xuyang Wang, Lingjuan Miao, Zhiqiang Zhou

PDF

TL;DR

CoCAViT introduces a robust, efficient vision transformer architecture that enhances out-of-distribution generalization and local-global feature modeling, achieving high accuracy and low latency on multiple benchmarks.

Contribution

The paper proposes CoCAViT, a novel vision transformer with a Coordinator-patch Cross Attention mechanism that improves robustness and performance of small models across diverse tasks.

Findings

01

Achieves 84.0% top-1 accuracy on ImageNet-1K with 28M parameters.

02

Significantly improves out-of-distribution benchmark performance.

03

Attains high accuracy on COCO detection and ADE20K segmentation with low latency.

Abstract

In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.