Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Yixuan Du; Chenxiao Yu; Haoyan Xu; Ziyi Wang; Yue Zhao; Xiyang Hu

arXiv:2601.12263·cs.CL·January 21, 2026

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu

PDF

Open Access

TL;DR

This paper exposes a vulnerability in vision-language models used in search rankings, showing how adversaries can manipulate rankings through combined image and text perturbations, highlighting a significant security concern.

Contribution

The paper introduces MGEO, a novel adversarial framework that jointly optimizes image and text perturbations to manipulate multimodal ranking systems, revealing a new security threat.

Findings

01

Multimodal attacks outperform unimodal baselines.

02

Adversarial perturbations can be imperceptible yet effective.

03

VLMs' cross-modal coupling can be exploited for manipulation.

Abstract

Vision-Language Models (VLMs) are rapidly replacing unimodal encoders in modern retrieval and recommendation systems. While their capabilities are well-documented, their robustness against adversarial manipulation in competitive ranking scenarios remains largely unexplored. In this paper, we uncover a critical vulnerability in VLM-based product search: multimodal ranking attacks. We present Multimodal Generative Engine Optimization (MGEO), a novel adversarial framework that enables a malicious actor to unfairly promote a target product by jointly optimizing imperceptible image perturbations and fluent textual suffixes. Unlike existing attacks that treat modalities in isolation, MGEO employs an alternating gradient-based optimization strategy to exploit the deep cross-modal coupling within the VLM. Extensive experiments on real-world datasets using state-of-the-art models demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Adversarial Robustness in Machine Learning · Advanced Graph Neural Networks