The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

Feiran Liu; Yuzhe Zhang; Xinyi Huang; Yinan Peng; Xinfeng Li; Lixu Wang; Yutong Shen; Ranjie Duan; Simeng Qin; Xiaojun Jia; Qingsong Wen; Wei Dong

arXiv:2505.19139·cs.CV·May 27, 2025

The Eye of Sherlock Holmes: Uncovering User Private Attribute Profiling via Vision-Language Model Agentic Framework

Feiran Liu, Yuzhe Zhang, Xinyi Huang, Yinan Peng, Xinfeng Li, Lixu Wang, Yutong Shen, Ranjie Duan, Simeng Qin, Xiaojun Jia, Qingsong Wen, Wei Dong

PDF

Open Access

TL;DR

This paper uncovers privacy risks in vision-language models that can infer sensitive and abstract user attributes from personal images, introducing a new dataset and a hybrid inference framework that outperforms existing methods and humans.

Contribution

It introduces PAPI, the largest dataset for private attribute profiling in images, and HolmesEye, a novel hybrid framework combining VLMs and LLMs for improved privacy inference.

Findings

01

HolmesEye improves accuracy by 10.8% over baselines.

02

It surpasses human performance by 15.0% in predicting abstract attributes.

03

The study highlights significant privacy risks in image-based profiling.

Abstract

Our research reveals a new privacy risk associated with the vision-language model (VLM) agentic framework: the ability to infer sensitive attributes (e.g., age and health information) and even abstract ones (e.g., personality and social traits) from a set of personal images, which we term "image private attribute profiling." This threat is particularly severe given that modern apps can easily access users' photo albums, and inference from image sets enables models to exploit inter-image relations for more sophisticated profiling. However, two main challenges hinder our understanding of how well VLMs can profile an individual from a few personal photos: (1) the lack of benchmark datasets with multi-image annotations for private attributes, and (2) the limited ability of current multimodal large language models (MLLMs) to infer abstract attributes from large image collections. In this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBlockchain Technology Applications and Security · Sentiment Analysis and Opinion Mining · Human Mobility and Location-Based Analysis

MethodsSparse Evolutionary Training