Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

Abhishek Dalvi; Vasant Honavar

arXiv:2602.23588·cs.CV·March 2, 2026

Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning

Abhishek Dalvi, Vasant Honavar

PDF

Open Access 1 Models

TL;DR

This paper introduces HDFLIM, a novel framework that aligns frozen vision and language models in a shared hyperdimensional space using symbolic operations, enabling efficient image captioning without model fine-tuning.

Contribution

HDFLIM demonstrates that cross-modal alignment can be achieved through symbolic hyperdimensional operations on frozen models, avoiding costly retraining and fine-tuning.

Findings

01

HDFLIM achieves comparable performance to end-to-end training methods.

02

Generated captions are more semantically grounded than zero-shot baselines.

03

Alignment is accomplished without modifying pretrained models.

Abstract

Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
adalvi/qwen2vl-lora-coco
model· 22 dl
22 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Ferroelectric and Negative Capacitance Devices