VLTP: Vision-Language Guided Token Pruning for Task-Oriented   Segmentation

Hanning Chen; Yang Ni; Wenjun Huang; Yezi Liu; SungHeon Jeong; Fei; Wen; Nathaniel Bastian; Hugo Latapie; Mohsen Imani

arXiv:2409.08464·cs.CV·December 2, 2024

VLTP: Vision-Language Guided Token Pruning for Task-Oriented Segmentation

Hanning Chen, Yang Ni, Wenjun Huang, Yezi Liu, SungHeon Jeong, Fei, Wen, Nathaniel Bastian, Hugo Latapie, Mohsen Imani

PDF

Open Access 1 Repo

TL;DR

VLTP introduces a vision-language guided token pruning method that significantly reduces computational costs in task-oriented segmentation models based on Vision Transformers, without substantial performance loss.

Contribution

The paper proposes a novel token pruning mechanism guided by vision-language models specifically designed for complex task-oriented segmentation tasks.

Findings

01

Reduces ViT computational costs by approximately 25% without performance loss.

02

Achieves around 40% reduction with only 1% performance drop.

03

Effective for multi-modal large language model guided segmentation.

Abstract

Vision Transformers (ViTs) have emerged as the backbone of many segmentation models, consistently achieving state-of-the-art (SOTA) performance. However, their success comes at a significant computational cost. Image token pruning is one of the most effective strategies to address this complexity. However, previous approaches fall short when applied to more complex task-oriented segmentation (TOS), where the class of each image patch is not predefined but dependent on the specific input task. This work introduces the Vision Language Guided Token Pruning (VLTP), a novel token pruning mechanism that can accelerate ViT-based segmentation models, particularly for TOS guided by multi-modal large language model (MLLM). We argue that ViT does not need to process every image token through all of its layers -- only the tokens related to reasoning tasks are necessary. We design a new pruning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HanningChen/VLTP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Advanced Neural Network Applications

MethodsPruning