DRIP: Dynamic patch Reduction via Interpretable Pooling
Yusen Peng, Sachin Kumar

TL;DR
DRIP is a method that dynamically reduces visual tokens in deep neural networks, significantly lowering computational costs while preserving performance across various vision-language tasks and datasets.
Contribution
This work introduces DRIP, a novel interpretable pooling technique that adaptively merges tokens in visual encoders, improving efficiency without sacrificing accuracy.
Findings
Significant GFLOP reduction on ImageNet training from scratch.
Maintains comparable performance in zero-shot and classification tasks.
Effective in scientific domain pretraining with biology datasets.
Abstract
Recently, the advances in vision-language models, including contrastive pretraining and instruction tuning, have greatly pushed the frontier of multimodal AI. However, owing to the large-scale and hence expensive pretraining, the efficiency concern has discouraged researchers from attempting to pretrain a vision language model from scratch. In this work, we propose Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the input images and dynamically merges tokens in the deeper layers of a visual encoder. Our results on both ImageNet training from scratch and CLIP contrastive pretraining demonstrate a significant GFLOP reduction while maintaining comparable classification/zero-shot performance. To further validate our proposed method, we conduct continual pretraining on a large biology dataset, extending its impact into scientific domains.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
