mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

Kyeong Seon Kim; Baek Seong-Eun; Lee Jung-Mok; Tae-Hyun Oh

arXiv:2604.17054·cs.CV·April 21, 2026

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

Kyeong Seon Kim, Baek Seong-Eun, Lee Jung-Mok, Tae-Hyun Oh

PDF

1 Repo

TL;DR

The paper introduces mEOL, a training-free, instruction-guided multimodal embedding framework that aligns text, images, and SVG code in a shared space, enhancing vector graphic and image retrieval without training.

Contribution

It presents a novel training-free method using MLLMs and SVG structural cues for multimodal embedding, outperforming traditional trained models in retrieval tasks.

Findings

01

mEOL outperforms encoder-based baselines in SVG and image retrieval.

02

The method enables prompt-level control over embeddings without training.

03

A new text-to-SVG retrieval benchmark demonstrates effectiveness.

Abstract

Scalable Vector Graphics (SVGs) function both as visual images and as structured code that encode rich geometric and layout information, yet most methods rasterize them and discard this symbolic organization. At the same time, recent sentence embedding methods produce strong text representations but do not naturally extend to visual or structured modalities. We propose a training-free, instruction-guided multimodal embedding framework that uses a Multimodal Large Language Model (MLLM) to map text, raster images, and SVG code into an aligned embedding space. We control the direction of embeddings through modality-specific instructions and structural SVG cues, eliminating the need for learned projection heads or contrastive training. Our method has two key components: (1) Multimodal Explicit One-word Limitation (mEOL), which instructs the MLLM to summarize any multimodal input into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://scene-the-ella.github.io/meol
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.