Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation
Jianyu Zhang, Li Zhang, Shijian Li

TL;DR
This paper introduces Feature Pyramid Tokenization (PAT), a method that enhances open vocabulary semantic segmentation by combining multi-resolution feature clustering with joint pixel and semantic learning, inspired by cognitive processes.
Contribution
It proposes a novel unified tokenization approach that improves semantic understanding and segmentation performance while maintaining parameter efficiency and flexibility.
Findings
Enhanced semantic intuition of VLM feature pyramids
Improved segmentation performance over baseline models
Achieved competitive results on open vocabulary benchmarks
Abstract
The visual understanding are often approached from 3 granular levels: image, patch and pixel. Visual Tokenization, trained by self-supervised reconstructive learning, compresses visual data by codebook in patch-level with marginal information loss, but the visual tokens does not have semantic meaning. Open Vocabulary semantic segmentation benefits from the evolving Vision-Language models (VLMs) with strong image zero-shot capability, but transferring image-level to pixel-level understanding remains an imminent challenge. In this paper, we treat segmentation as tokenizing pixels and study a united perceptual and semantic token compression for all granular understanding and consequently facilitate open vocabulary semantic segmentation. Referring to the cognitive process of pretrained VLM where the low-level features are progressively composed to high-level semantics, we propose Feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
