FrEVL: Leveraging Frozen Pretrained Embeddings for Efficient Vision-Language Understanding
Emmanuelle Bourigault, Pauline Bourigault

TL;DR
FrEVL demonstrates that using frozen pretrained embeddings can achieve near state-of-the-art vision-language understanding performance with significantly reduced computational cost, making it suitable for resource-constrained scenarios.
Contribution
The paper introduces FrEVL, a novel framework that leverages frozen pretrained embeddings for efficient vision-language tasks, reducing training complexity and energy consumption.
Findings
Achieves 85-95% of state-of-the-art performance with fewer trainable parameters.
Provides 2.3x speedup and 52% lower energy consumption compared to end-to-end training.
Effectiveness depends on alignment between pretraining objectives and downstream tasks.
Abstract
The deployment of vision-language models remains constrained by substantial computational requirements. We present \textbf{FrEVL}, a framework exploring whether frozen pretrained embeddings can support effective vision-language understanding. Our analysis reveals that frozen embeddings contain rich information for discriminative tasks, achieving 85\% to 95\% of state-of-the-art performance on standard benchmarks with only 68.4M trainable parameters. This performance dichotomy reveals a critical insight: frozen embedding effectiveness depends on alignment between pretraining objectives and downstream task requirements. When accounting for end-to-end computation including embedding extraction, FrEVL provides speedup with 52\% lower energy consumption, making it suitable for scenarios with pre-computable inputs or when deployment constraints outweigh marginal performance gains.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
