GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

Zesheng Li,Chengchang Pan,Honggang Qi

arXiv:2605.17727·cs.CV·May 19, 2026

GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations

Zesheng Li,Chengchang Pan,Honggang Qi

PDF

TL;DR

GraSP-VL introduces a method to turn fixed-length vision-language embeddings into a controllable semantic interface using shared prefix transforms, enabling multi-resolution semantic access without retraining.

Contribution

It proposes GraSP-VL, a novel shared prefix transform that reorganizes frozen embeddings into a hierarchical semantic interface with controllable granularity.

Findings

01

Achieves high staircase score and hard-negative selectivity on COCO/Flickr30K.

02

Transfers effectively to SugarCrepe-clean with high object accuracy.

03

Maintains full-dimensional zero-shot CIFAR-100 accuracy.

Abstract

Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.