TL;DR
The paper introduces mEOL, a training-free, instruction-guided multimodal embedding framework that aligns text, images, and SVG code in a shared space, enhancing vector graphic and image retrieval without training.
Contribution
It presents a novel training-free method using MLLMs and SVG structural cues for multimodal embedding, outperforming traditional trained models in retrieval tasks.
Findings
mEOL outperforms encoder-based baselines in SVG and image retrieval.
The method enables prompt-level control over embeddings without training.
A new text-to-SVG retrieval benchmark demonstrates effectiveness.
Abstract
Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
