AdvDreamer Unveils: Are Vision-Language Models Truly Ready for   Real-World 3D Variations?

Shouwei Ruan; Hanqing Liu; Yao Huang; Xiaoqi Wang; Caixin Kang; Hang; Su; Yinpeng Dong; Xingxing Wei

arXiv:2412.03002·cs.CV·March 11, 2025

AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

Shouwei Ruan, Hanqing Liu, Yao Huang, Xiaoqi Wang, Caixin Kang, Hang, Su, Yinpeng Dong, Xingxing Wei

PDF

Open Access

TL;DR

This paper introduces AdvDreamer, a framework for generating adversarial 3D transformations to evaluate the robustness of vision-language models in real-world scenarios, revealing significant vulnerabilities.

Contribution

AdvDreamer is the first system to produce physically reproducible adversarial 3D samples from single views, enabling systematic robustness evaluation of VLMs against real-world 3D variations.

Findings

01

VLMs are vulnerable to real-world 3D variations.

02

AdvDreamer generates high-quality adversarial samples.

03

Robustness gaps are identified across different VLM architectures.

Abstract

Vision Language Models (VLMs) have exhibited remarkable generalization capabilities, yet their robustness in dynamic real-world scenarios remains largely unexplored. To systematically evaluate VLMs' robustness to real-world 3D variations, we propose AdvDreamer, the first framework capable of generating physically reproducible Adversarial 3D Transformation (Adv-3DT) samples from single-view observations. In AdvDreamer, we integrate three key innovations: Firstly, to characterize real-world 3D variations with limited prior knowledge precisely, we design a zero-shot Monocular Pose Manipulation pipeline built upon generative 3D priors. Secondly, to ensure the visual quality of worst-case Adv-3DT samples, we propose a Naturalness Reward Model that provides continuous naturalness regularization during adversarial optimization, effectively preventing convergence to hallucinated or unnatural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies