Hyperdimensional Cross-Modal Alignment of Frozen Language and Image Models for Efficient Image Captioning
Abhishek Dalvi, Vasant Honavar

TL;DR
This paper introduces HDFLIM, a novel framework that aligns frozen vision and language models in a shared hyperdimensional space using symbolic operations, enabling efficient image captioning without model fine-tuning.
Contribution
HDFLIM demonstrates that cross-modal alignment can be achieved through symbolic hyperdimensional operations on frozen models, avoiding costly retraining and fine-tuning.
Findings
HDFLIM achieves comparable performance to end-to-end training methods.
Generated captions are more semantically grounded than zero-shot baselines.
Alignment is accomplished without modifying pretrained models.
Abstract
Large unimodal foundation models for vision and language encode rich semantic structures, yet aligning them typically requires computationally intensive multimodal fine-tuning. Such approaches depend on large-scale parameter updates, are resource intensive, and can perturb pretrained representations. Emerging evidence suggests, however, that independently trained foundation models may already exhibit latent semantic compatibility, reflecting shared structures in the data they model. This raises a fundamental question: can cross-modal alignment be achieved without modifying the models themselves? Here we introduce HDFLIM (HyperDimensional computing with Frozen Language and Image Models), a framework that establishes cross-modal mappings while keeping pretrained vision and language models fully frozen. HDFLIM projects unimodal embeddings into a shared hyperdimensional space and leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Ferroelectric and Negative Capacitance Devices
