ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan; Volodymyr Mnih; Aleksandra Faust; Matei Zaharia; Pieter; Abbeel; Hao Liu

arXiv:2410.08368·cs.LG·February 4, 2025

ElasticTok: Adaptive Tokenization for Image and Video

Wilson Yan, Volodymyr Mnih, Aleksandra Faust, Matei Zaharia, Pieter, Abbeel, Hao Liu

PDF

Open Access 3 Reviews

TL;DR

ElasticTok introduces an adaptive tokenization method for images and videos that dynamically allocates tokens based on data complexity, improving efficiency and encoding quality for long sequences.

Contribution

It proposes a novel adaptive tokenization technique conditioned on prior frames, enabling variable-length encoding for more efficient processing of visual data.

Findings

01

Effective in reducing token usage while maintaining quality

02

Improves processing of complex and simple data dynamically

03

Enhances scalability for long video sequences

Abstract

Efficient video tokenization remains a key bottleneck in learning general purpose vision models that are capable of processing long video sequences. Prevailing approaches are restricted to encoding videos to a fixed number of tokens, where too few tokens will result in overly lossy encodings, and too many tokens will result in prohibitively long sequence lengths. In this work, we introduce ElasticTok, a method that conditions on prior frames to adaptively encode a frame into a variable number of tokens. To enable this in a computationally scalable way, we propose a masking technique that drops a random number of tokens at the end of each frames's token encoding. During inference, ElasticTok can dynamically allocate tokens when needed -- more complex data can leverage more tokens, while simpler data only needs a few tokens. Our empirical evaluations on images and video demonstrate the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

- The paper studies an interesting topic of adaptive tokenization for images and videos. As mentioned, such adaptive tokenization is crucial for long-context tasks. - While similar adaptive and compression architectures have been studied in the past, the application to auto-encoding of images and videos is a novel contribution. Related works in adaptive representations and compression are well covered. - Results include interesting analysis of token requirement for reconstruction and VLM tasks.

Weaknesses

- The paper writing needs improvement – - **A background section on Ring attention** (Liu et al., 2024b) paper should be the first paragraph of the method section since Ring attention is a big component of the proposed approach. Most important are the **details of the autoregressive video model**. - Algorithm 1 and 2 says variable "x" (output of PatchifyRearrange) shape is B $\times$ L $\times$ D, what exactly is L ? From the next line ($N_b$ = the number of blocks), it seems that "x" sh

Reviewer 02Rating 6Confidence 3

Strengths

1. This paper proposes an effective way to compress the number of tokens adaptively. 2. The length of used tokens aligns well with human intuition: when the image is more complex, more tokens are used. 3. The proposed method can effectively represent long videos with up to 2-5x fewer tokens.

Weaknesses

1. Lack of comparison. The proposed method should be compared with other state-of-the-art token compression methods considering the compress rate, reconstruction quality, and downstream task performance. 2. Loss of details in reconstruction. From the demos, the reconstruction quality is not very satisfactory. For example, the loading icon on the cellphone has totally disappeared/distorted. 3. More analysis of the drop in image quality when using the elastic token compression is desired. Consi

Reviewer 03Rating 6Confidence 3

Strengths

Exploiting temporal coherence to learn a more compact video representation is a well-motivated problem. Figure 4 does indeed show that the dropout strategy can reduce the number of tokens as intended, although it requires a computationally exhaustive search to find the right number of tokens for each frame.

Weaknesses

The "neural regression" evaluation (Table 4) does not seem very helpful. If I understand correctly, the error rate is the rate the regressor failed to correctly predict the optimal number of tokens found by the exhaustive search. It would have been good to also show the reconstruction accuracy. Also it would have been good to show the total inference time or FLOPS for each search method. It can be architecture-dependent but how much computation can be saved using fewer tokens in the decoder shou

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Computer Graphics and Visualization Techniques · Video Coding and Compression Technologies