Geometry-Aware CLIP Retrieval via Local Cross-Modal Alignment and Steering

Nirmalendu Prakash; Narmeen Fatimah Oozeer; Xin Su; Phillip Howard; Shaan Shah; Zoe Wanying He; Shuang Wu; Shivam Raval; Roy Ka-Wei Lee; Meenakshi Khosla; and Amir Abdullah

arXiv:2604.16487·cs.CV·April 22, 2026

Geometry-Aware CLIP Retrieval via Local Cross-Modal Alignment and Steering

Nirmalendu Prakash, Narmeen Fatimah Oozeer, Xin Su, Phillip Howard, Shaan Shah, Zoe Wanying He, Shuang Wu, Shivam Raval, Roy Ka-Wei Lee, Meenakshi Khosla, and Amir Abdullah

PDF

TL;DR

This paper proposes a geometry-aware CLIP retrieval method that enhances local neighborhood alignment and control, improving attribute-binding and compositional retrieval without retraining.

Contribution

It introduces neighborhood re-ranking with Hungarian matching and query-conditioned local steering to improve local structure in CLIP retrieval.

Findings

01

Improved retrieval performance on attribute-binding tasks.

02

Enhanced control over retrieval results via local neighborhood steering.

03

Demonstrated effectiveness without retraining the model.

Abstract

CLIP retrieval is typically framed as a pointwise similarity problem in a shared embedding space. While CLIP achieves strong global cross-modal alignment, many retrieval failures arise from local geometric inconsistencies: nearby items are incorrectly ordered, leading to systematic confusions (e.g., pentagon vs. hexagon) and produces diffuse, weakly controlled result sets. Prior work largely optimizes for point wise relevance or finetuning to mitigate these problems. We instead view retrieval as a problem of neighborhood alignment. Our work introduces (1) neighborhood-level re-ranking via Hungarian matching, which rewards structural consistency; (2) query-conditioned local steering, where directions derived from contrastive neighborhoods around the query reshape retrieval. We show that these techniques improve retrieval performance on attribute-binding and compositional retrieval tasks.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.