EdgeTAM: On-Device Track Anything Model
Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao,, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra,, Bilge Soran

TL;DR
EdgeTAM significantly enhances the efficiency of the SAM 2 video segmentation model, enabling real-time on-device processing on mobile devices while maintaining high accuracy through a novel 2D Spatial Perceiver and distillation techniques.
Contribution
The paper introduces EdgeTAM, a novel approach that reduces computational cost of SAM 2 using a 2D Spatial Perceiver and a distillation pipeline, enabling mobile device deployment.
Findings
Achieves high segmentation accuracy on multiple benchmarks.
Runs at 16 FPS on iPhone 15 Pro Max.
Maintains performance comparable to state-of-the-art models.
Abstract
On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Big Data and Digital Economy
MethodsAttention Is All You Need · Absolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer
