Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic   Segmentation

Jianyu Zhang; Li Zhang; Shijian Li

arXiv:2412.14145·cs.CV·December 19, 2024

Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation

Jianyu Zhang, Li Zhang, Shijian Li

PDF

Open Access

TL;DR

This paper introduces Feature Pyramid Tokenization (PAT), a method that enhances open vocabulary semantic segmentation by combining multi-resolution feature clustering with joint pixel and semantic learning, inspired by cognitive processes.

Contribution

It proposes a novel unified tokenization approach that improves semantic understanding and segmentation performance while maintaining parameter efficiency and flexibility.

Findings

01

Enhanced semantic intuition of VLM feature pyramids

02

Improved segmentation performance over baseline models

03

Achieved competitive results on open vocabulary benchmarks

Abstract

The visual understanding are often approached from 3 granular levels: image, patch and pixel. Visual Tokenization, trained by self-supervised reconstructive learning, compresses visual data by codebook in patch-level with marginal information loss, but the visual tokens does not have semantic meaning. Open Vocabulary semantic segmentation benefits from the evolving Vision-Language models (VLMs) with strong image zero-shot capability, but transferring image-level to pixel-level understanding remains an imminent challenge. In this paper, we treat segmentation as tokenizing pixels and study a united perceptual and semantic token compression for all granular understanding and consequently facilitate open vocabulary semantic segmentation. Referring to the cognitive process of pretrained VLM where the low-level features are progressively composed to high-level semantics, we propose Feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling