Subobject-level Image Tokenization
Delong Chen, Samuel Cahyawijaya, Jianfeng Liu, Baoyuan Wang, Pascale, Fung

TL;DR
This paper introduces EPOC, a subobject-level image tokenizer that improves visual segmentation and understanding by combining boundary detection with watershed segmentation, leading to more efficient and accurate image representations.
Contribution
The paper proposes EPOC, a novel image tokenizer that effectively segments objects at the subobject level, outperforming patch-based methods in efficiency and alignment with human visual morphology.
Findings
EPOC produces more monosemantic tokens aligned with human annotations.
EPOC enables faster convergence in vision-language models.
Subobject tokenization improves generalization with fewer tokens.
Abstract
Patch-based image tokenization ignores the morphology of the visual world, limiting effective and efficient learning of image understanding. Inspired by subword tokenization, we introduce subobject-level adaptive token segmentation and explore several approaches, including superpixel, SAM, and a proposed Efficient and PanOptiC (EPOC) image tokenizer. Our EPOC combines boundary detection -- a simple task that can be handled well by a compact model -- with watershed segmentation, which inherently guarantees no pixels are left unsegmented. Intrinsic evaluations across 5 datasets demonstrate that EPOC's segmentation aligns well with human annotations of both object- and part-level visual morphology, producing more monosemantic tokens and offering substantial efficiency advantages. For extrinsic evaluation, we designed a token embedding that handles arbitrary-shaped tokens, and trained VLMs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCell Image Analysis Techniques · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
