EfficientViT-SAM: Accelerated Segment Anything Model Without Accuracy Loss
Zhuoyang Zhang, Han Cai, Song Han

TL;DR
EfficientViT-SAM introduces a faster segment anything model by replacing the heavy image encoder with EfficientViT, achieving significant speedup without losing accuracy, enabling more efficient image segmentation tasks.
Contribution
This work develops EfficientViT-SAM, combining EfficientViT with SAM's architecture, and demonstrates substantial speed improvements through knowledge distillation and end-to-end training.
Findings
48.9x speedup on A100 GPU
Maintains original segmentation performance
Open-source code and models provided
Abstract
We present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance. Our code and pre-trained models are released at https://github.com/mit-han-lab/efficientvit.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · IoT and Edge/Fog Computing · Distributed systems and fault tolerance
MethodsKnowledge Distillation
