Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote   Sensing Image Understanding

Run Shao; Zhaoyang Zhang; Chao Tao; Yunsheng Zhang; Chengli Peng,; Haifeng Li

arXiv:2403.18593·cs.CV·October 15, 2024·1 cites

Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding

Run Shao, Zhaoyang Zhang, Chao Tao, Yunsheng Zhang, Chengli Peng,, Haifeng Li

PDF

Open Access 1 Repo

TL;DR

This paper introduces HOOK, a homogeneous visual tokenizer for remote sensing images that produces meaningful object-based tokens, improving accuracy and efficiency over patch-based methods in classification and segmentation tasks.

Contribution

The paper proposes a novel homogeneous visual tokenizer, HOOK, which perceives semantically independent regions and generates meaningful object tokens, advancing visual understanding in remote sensing.

Findings

01

HOOK outperforms Patch Embed by 6-10% in accuracy.

02

HOOK uses fewer tokens, improving efficiency by 1.5-2.8 times.

03

State-of-the-art performance on multiple remote sensing datasets.

Abstract

The tokenizer, as one of the fundamental components of large models, has long been overlooked or even misunderstood in visual tasks. One key factor of the great comprehension power of the large language model is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision, which cannot serve as effectively as words or subwords in language. Starting from the essence of the tokenizer, we defined semantically independent regions (SIRs) for vision. We designed a simple HOmogeneous visual tOKenizer: HOOK. HOOK mainly consists of two modules: the Object Perception Module (OPM) and the Object Vectorization Module (OVM). To achieve homogeneity, the OPM splits the image into 4*4 pixel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

geox-lab/hook
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques