FREE: Fast and Robust Vision Language Models with Early Exits
Divya Jyoti Bajpai, Manjesh Kumar Hanawal

TL;DR
This paper introduces FREE, an adversarial training method with early exit strategies for vision-language models, significantly speeding up inference while maintaining accuracy and robustness.
Contribution
We propose a novel adversarial training framework for early exit strategies in VLMs, improving inference speed and robustness with minimal performance loss.
Findings
Speeds up inference by over 1.51x
Enhances model robustness against overthinking
Maintains comparable accuracy with faster inference
Abstract
In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in Vision-Language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLMs. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce FREE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier. The transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that increases inference speed with minimal drop in performance. Experimental results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
