TL;DR
This study evaluates the privacy awareness of vision-language models in realistic physical environments using a new interactive framework, revealing significant perceptual and contextual limitations in current models.
Contribution
Introduces ImmersedPrivacy, an interactive simulation framework for assessing physical-world privacy awareness of VLMs, highlighting their perceptual and contextual shortcomings.
Findings
All models show performance decay with increased scene complexity.
No model exceeds 65% accuracy in social context shifts.
Best model balances task and privacy in only 51% of conflicting commands.
Abstract
As Vision-Language Models (VLMs) are increasingly deployed as autonomous cognitive cores for embodied assistants, evaluating their privacy awareness in physical environments becomes critical. Unlike digital chatbots, these agents operate in intimate spaces, such as homes and hospitals, where they possess the physical agency to observe and manipulate privacy-sensitive information and artifacts. However, current benchmarks remain limited to unimodal, text-based representations that cannot capture the demands of real-world settings. To bridge this gap, we present ImmersedPrivacy, an interactive audio-visual evaluation framework that simulates realistic physical environments using a Unity-based simulator. ImmersedPrivacy evaluates physically grounded privacy awareness across three progressive tiers that test a model's ability to identify sensitive items in cluttered scenes, adapt to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
