Differentially Private Multimodal In-Context Learning
Ivoline C. Ngong, Zarreen Reza, Joseph P. Near

TL;DR
This paper introduces DP-MTV, a novel framework for many-shot multimodal in-context learning that maintains formal differential privacy, enabling sensitive data applications with minimal performance loss.
Contribution
DP-MTV is the first method to achieve differentially private multimodal in-context learning with hundreds of demonstrations, using a compact task vector aggregation approach.
Findings
Achieves 50% accuracy on VizWiz at ε=1.0, close to 55% non-private.
Supports multiple VLM architectures and auxiliary data.
Requires only a single noise addition for unlimited inference.
Abstract
Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal -differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivacy-Preserving Technologies in Data · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
