Token Turing Machines are Efficient Vision Models

Purvish Jajal; Nick John Eliopoulos; Benjamin Shiue-Hal Chou; George; K. Thiruvathukal; James C. Davis; and Yung-Hsiang Lu

arXiv:2409.07613·cs.CV·January 27, 2025

Token Turing Machines are Efficient Vision Models

Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George, K. Thiruvathukal, James C. Davis, and Yung-Hsiang Lu

PDF

Open Access 1 Repo

TL;DR

The paper introduces Vision Token Turing Machines (ViTTM), a memory-augmented transformer that significantly reduces inference time and computational cost for vision tasks while maintaining or improving accuracy.

Contribution

It presents a novel architecture combining process and memory tokens for efficient, low-latency vision modeling, extending Turing machine concepts to vision transformers.

Findings

01

ViTTM-B is 56% faster than ViT-B on ImageNet-1K.

02

ViTTM-B achieves higher accuracy (82.9%) with fewer FLOPs.

03

On ADE20K, ViTTM-B doubles the FPS compared to ViT-B while maintaining similar mIoU.

Abstract

We propose Vision Token Turing Machines (ViTTM), an efficient, low-latency, memory-augmented Vision Transformer (ViT). Our approach builds on Neural Turing Machines and Token Turing Machines, which were applied to NLP and sequential visual understanding tasks. ViTTMs are designed for non-sequential computer vision tasks such as image classification and segmentation. Our model creates two sets of tokens: process tokens and memory tokens; process tokens pass through encoder blocks and read-write from memory tokens at each encoder block in the network, allowing them to store and retrieve information from memory. By ensuring that there are fewer process tokens than memory tokens, we are able to reduce the inference time of the network while maintaining its accuracy. On ImageNet-1K, the state-of-the-art ViT-B has median latency of 529.5ms and 81.0% accuracy, while our ViTTM-B is 56% faster…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pjjajal/efficientttms
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsByte Pair Encoding · Absolute Position Encodings · Vision Transformer · Softmax · Label Smoothing · Dropout · Layer Normalization · Attention Is All You Need · Position-Wise Feed-Forward Layer · Linear Layer