TL;DR
FTerViT introduces a fully ternarized Vision Transformer with all components ternarized, achieving high accuracy and compression, and demonstrating deployment on microcontrollers.
Contribution
It presents the first fully ternarized Vision Transformer with novel operators and deployment on microcontrollers, enhancing efficiency and on-device feasibility.
Findings
Achieves 82.43% ImageNet accuracy at 6.09MB with ternary weights.
Outperforms prior ternary ViT methods by up to 8 percentage points.
Successfully deploys on dual-core microcontroller, demonstrating practical on-device use.
Abstract
Ternary Vision Transformers offer substantial model compression, however state-of-the-art methods only ternarize the encoder layers, leaving patch embeddings, LayerNorm parameters, and classifier heads in full precision. In compact models targeting resource-constrained processors, such as microcontrollers, these remaining full-precision components determine the total memory footprint, severely limiting deployment efficiency and on-device feasibility. In this work, we introduce a fully ternarized Vision Transformer in which \emph{all} weight matrices and normalization parameters are ternarized (FTerViT). To this end, we introduce two novel operators : TernaryBitConv2d with per-channel scaling for patch embedding and TernaryLayerNorm. FTerViT is trained using knowledge distillation, followed by a lightweight quantization-aware recovery phase. Our ternary W2A8 DeiT-III-S at 384384…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
