CoCAViT: Compact Vision Transformer with Robust Global Coordination
Xuyang Wang, Lingjuan Miao, Zhiqiang Zhou

TL;DR
CoCAViT introduces a robust, efficient vision transformer architecture that enhances out-of-distribution generalization and local-global feature modeling, achieving high accuracy and low latency on multiple benchmarks.
Contribution
The paper proposes CoCAViT, a novel vision transformer with a Coordinator-patch Cross Attention mechanism that improves robustness and performance of small models across diverse tasks.
Findings
Achieves 84.0% top-1 accuracy on ImageNet-1K with 28M parameters.
Significantly improves out-of-distribution benchmark performance.
Attains high accuracy on COCO detection and ADE20K segmentation with low latency.
Abstract
In recent years, large-scale visual backbones have demonstrated remarkable capabilities in learning general-purpose features from images via extensive pre-training. Concurrently, many efficient architectures have emerged that have performance comparable to that of larger models on in-domain benchmarks. However, we observe that for smaller models, the performance drop on out-of-distribution (OOD) data is disproportionately larger, indicating a deficiency in the generalization performance of existing efficient models. To address this, we identify key architectural bottlenecks and inappropriate design choices that contribute to this issue, retaining robustness for smaller models. To restore the global field of pure window attention, we further introduce a Coordinator-patch Cross Attention (CoCA) mechanism, featuring dynamic, domain-aware global tokens that enhance local-global feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
