Subobject-level Image Tokenization

Delong Chen; Samuel Cahyawijaya; Jianfeng Liu; Baoyuan Wang; Pascale; Fung

arXiv:2402.14327·cs.CV·March 14, 2025·2 cites

Subobject-level Image Tokenization

Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale, Fung

PDF

Open Access 1 Repo 1 Models

TL;DR

This paper introduces EPOC, a subobject-level image tokenizer that improves visual segmentation and understanding by combining boundary detection with watershed segmentation, leading to more efficient and accurate image representations.

Contribution

The paper proposes EPOC, a novel image tokenizer that effectively segments objects at the subobject level, outperforming patch-based methods in efficiency and alignment with human visual morphology.

Findings

01

EPOC produces more monosemantic tokens aligned with human annotations.

02

EPOC enables faster convergence in vision-language models.

03

Subobject tokenization improves generalization with fewer tokens.

Abstract

Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simple task that can be handled well by a compact model -- with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chendelong1999/subobjects
pytorchOfficial

Models

🤗
chendelong/DirectSAM-1800px-0424
model· 2.4k dl· ♡ 3
2.4k dl♡ 3

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCell Image Analysis Techniques · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques