t-gems: text-guided exit modules for decreasing clip image encoder
Alberto Presta, Grzegorz Stefanski, Michal Byra, Krzysztof Arendt

TL;DR
This paper introduces T-GEMs, a novel method for reducing computational costs in multimodal image-text encoders by leveraging text-guided exit modules and a rate-based regularizer.
Contribution
The paper proposes T-GEMs and a rate-based regularizer to efficiently control encoder usage in multimodal models without sacrificing performance.
Findings
T-GEMs effectively reduce encoder computation during inference.
The rate-based regularizer maintains cross-modal understanding while decreasing resource use.
The approach adapts exit points based on semantic content distributions.
Abstract
Multimodal deep neural networks enhance deep comprehension by integrating diverse data modalities. Data from different modalities are typically projected into a shared latent space for similarity computation, but this process is resource intensive due to large image encoders and equal processing of test data during prediction. Early exit methods reduce computational load by utilizing intermediate layers, saving time and memory. However, developing such methods is challenging for multimodal data like image-text pairs. This study investigates the semantic content distributions present in intermediate layers of encoders such as CLIP, which can be derived from textual descriptions. We introduce Text-Guided Exit Modules (T-GEMs) and a rate-based regularizer to control encoder usage costs while maintaining cross-modal understanding performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
