DRIP: Dynamic patch Reduction via Interpretable Pooling

Yusen Peng; Sachin Kumar

arXiv:2510.25067·cs.CV·November 5, 2025

DRIP: Dynamic patch Reduction via Interpretable Pooling

Yusen Peng, Sachin Kumar

PDF

TL;DR

DRIP is a method that dynamically reduces visual tokens in deep neural networks, significantly lowering computational costs while preserving performance across various vision-language tasks and datasets.

Contribution

This work introduces DRIP, a novel interpretable pooling technique that adaptively merges tokens in visual encoders, improving efficiency without sacrificing accuracy.

Findings

01

Significant GFLOP reduction on ImageNet training from scratch.

02

Maintains comparable performance in zero-shot and classification tasks.

03

Effective in scientific domain pretraining with biology datasets.

Abstract

Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.