Homogeneous Tokenizer Matters: Homogeneous Visual Tokenizer for Remote Sensing Image Understanding
Run Shao, Zhaoyang Zhang, Chao Tao, Yunsheng Zhang, Chengli Peng,, Haifeng Li

TL;DR
This paper introduces HOOK, a homogeneous visual tokenizer for remote sensing images that produces meaningful object-based tokens, improving accuracy and efficiency over patch-based methods in classification and segmentation tasks.
Contribution
The paper proposes a novel homogeneous visual tokenizer, HOOK, which perceives semantically independent regions and generates meaningful object tokens, advancing visual understanding in remote sensing.
Findings
HOOK outperforms Patch Embed by 6-10% in accuracy.
HOOK uses fewer tokens, improving efficiency by 1.5-2.8 times.
State-of-the-art performance on multiple remote sensing datasets.
Abstract
The tokenizer, as one of the fundamental components of large models, has long been overlooked or even misunderstood in visual tasks. One key factor of the great comprehension power of the large language model is that natural language tokenizers utilize meaningful words or subwords as the basic elements of language. In contrast, mainstream visual tokenizers, represented by patch-based methods such as Patch Embed, rely on meaningless rectangular patches as basic elements of vision, which cannot serve as effectively as words or subwords in language. Starting from the essence of the tokenizer, we defined semantically independent regions (SIRs) for vision. We designed a simple HOmogeneous visual tOKenizer: HOOK. HOOK mainly consists of two modules: the Object Perception Module (OPM) and the Object Vectorization Module (OVM). To achieve homogeneity, the OPM splits the image into 4*4 pixel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques
