PERL: Parameter Efficient Reasoning in CLIP Latent Space

Simone Carnemolla; Salvatore Calcagno; Daniela Giordano; Concetto Spampinato; Matteo Pennisi

arXiv:2605.18464·cs.CV·May 20, 2026

PERL: Parameter Efficient Reasoning in CLIP Latent Space

Simone Carnemolla, Salvatore Calcagno, Daniela Giordano, Concetto Spampinato, Matteo Pennisi

PDF

TL;DR

PERL introduces a lightweight, iterative latent reasoning framework that enhances CLIP's adaptation to downstream tasks with minimal trainable parameters, maintaining open-vocabulary capabilities.

Contribution

It proposes a novel latent reasoning approach for CLIP, enabling effective adaptation without increasing parameter count significantly.

Findings

01

PERL achieves state-of-the-art parameter-performance trade-off across 15 benchmarks.

02

It maintains strong zero-shot and out-of-distribution performance.

03

Uses only about 6K trainable parameters, vastly fewer than existing methods.

Abstract

Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.