EdgeTAM: On-Device Track Anything Model

Chong Zhou; Chenchen Zhu; Yunyang Xiong; Saksham Suri; Fanyi Xiao,; Lemeng Wu; Raghuraman Krishnamoorthi; Bo Dai; Chen Change Loy; Vikas Chandra,; Bilge Soran

arXiv:2501.07256·cs.CV·January 14, 2025

EdgeTAM: On-Device Track Anything Model

Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao,, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra,, Bilge Soran

PDF

Open Access 1 Repo 3 Models

TL;DR

EdgeTAM significantly enhances the efficiency of the SAM 2 video segmentation model, enabling real-time on-device processing on mobile devices while maintaining high accuracy through a novel 2D Spatial Perceiver and distillation techniques.

Contribution

The paper introduces EdgeTAM, a novel approach that reduces computational cost of SAM 2 using a 2D Spatial Perceiver and a distillation pipeline, enabling mobile device deployment.

Findings

01

Achieves high segmentation accuracy on multiple benchmarks.

02

Runs at 16 FPS on iPhone 15 Pro Max.

03

Maintains performance comparable to state-of-the-art models.

Abstract

On top of Segment Anything Model (SAM), SAM 2 further extends its capability from image to video inputs through a memory bank mechanism and obtains a remarkable performance compared with previous methods, making it a foundation model for video segmentation task. In this paper, we aim at making SAM 2 much more efficient so that it even runs on mobile devices while maintaining a comparable performance. Despite several works optimizing SAM for better efficiency, we find they are not sufficient for SAM 2 because they all focus on compressing the image encoder, while our benchmark shows that the newly introduced memory attention blocks are also the latency bottleneck. Given this observation, we propose EdgeTAM, which leverages a novel 2D Spatial Perceiver to reduce the computational cost. In particular, the proposed 2D Spatial Perceiver encodes the densely stored frame-level memories with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/edgetam
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Big Data and Digital Economy

MethodsAttention Is All You Need · Absolute Position Encodings · Adam · Residual Connection · Dropout · Softmax · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer