SPANER: Shared Prompt Aligner for Multimodal Semantic Representation

Thye Shan Ng; Caren Soyeon Han; Eun-Jung Holden

arXiv:2508.13387·cs.AI·August 20, 2025

SPANER: Shared Prompt Aligner for Multimodal Semantic Representation

Thye Shan Ng, Caren Soyeon Han, Eun-Jung Holden

PDF

TL;DR

SPANER introduces a unified, modality-agnostic prompt framework that aligns diverse multimodal inputs into a shared semantic space, improving cross-modal generalisation and retrieval performance.

Contribution

The paper presents SPANER, a novel PEFT approach that uses shared prompts to unify multimodal embeddings, supporting additional modalities without architectural changes.

Findings

01

Achieves competitive few-shot retrieval results.

02

Maintains high semantic coherence across modalities.

03

Supports seamless addition of new modalities.

Abstract

Recent advances in multimodal Parameter-Efficient Fine-Tuning (PEFT) have significantly improved performance on downstream tasks such as few-shot retrieval. However, most existing approaches focus on task-specific gains while neglecting the structure of the multimodal embedding space. As a result, modality-specific representations often remain isolated, limiting cross-modal generalisation. In this work, we introduce Shared Prompt AligNER (SPANER), a modality-agnostic PEFT framework designed to embed inputs from diverse modalities into a unified semantic space. At its core, SPANER employs a shared prompt mechanism that acts as a conceptual anchor, enabling semantically related instances to converge spatially regardless of modality. This shared prompt design is inherently extensible, supporting the seamless integration of additional modalities, such as audio, without altering the core…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.