GraSP-VL: Length as a Semantic Granularity Interface for Vision-Language Representations
Zesheng Li,Chengchang Pan,Honggang Qi

TL;DR
GraSP-VL introduces a method to turn fixed-length vision-language embeddings into a controllable semantic interface using shared prefix transforms, enabling multi-resolution semantic access without retraining.
Contribution
It proposes GraSP-VL, a novel shared prefix transform that reorganizes frozen embeddings into a hierarchical semantic interface with controllable granularity.
Findings
Achieves high staircase score and hard-negative selectivity on COCO/Flickr30K.
Transfers effectively to SugarCrepe-clean with high object accuracy.
Maintains full-dimensional zero-shot CIFAR-100 accuracy.
Abstract
Frozen vision-language embeddings contain signals at multiple semantic resolutions, from object identity to attributes, relations, and full-caption meaning, but they expose these signals through a fixed-length vector interface. We study whether embedding length can be turned into a controllable semantic access interface. We propose \textbf{GraSP-VL}, which learns a shared near-orthogonal prefix transform over frozen VLM embeddings. GraSP-VL instantiates a \textbf{Semantic Matryoshka} interface: short prefixes are assigned coarse semantic roles, while longer prefixes progressively expose finer language-grounded distinctions. Because the transform is shared across image and text embeddings and preserves full-dimensional geometry, prefix behavior changes without rewriting the original VLM space. On a 20,147-example COCO/Flickr30K annotation pool, GraSP-VL reaches a staircase score of 53.01…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
