Self-Comparison for Dataset-Level Membership Inference in Large   (Vision-)Language Models

Jie Ren; Kangrui Chen; Chen Chen; Vikash Sehwag; Yue Xing; Jiliang; Tang; Lingjuan Lyu

arXiv:2410.13088·cs.LG·October 18, 2024

Self-Comparison for Dataset-Level Membership Inference in Large (Vision-)Language Models

Jie Ren, Kangrui Chen, Chen Chen, Vikash Sehwag, Yue Xing, Jiliang, Tang, Lingjuan Lyu

PDF

Open Access

TL;DR

This paper introduces a novel dataset-level membership inference method based on self-comparison and paraphrasing, which effectively detects training data membership in large vision-language models without needing ground-truth non-member data.

Contribution

The paper proposes a new self-comparison based dataset inference technique that does not require ground-truth non-member data, improving practicality and effectiveness over existing methods.

Findings

01

Outperforms traditional membership inference attacks and dataset inference techniques.

02

Effective across various datasets and models, including commercial APIs.

03

Does not require access to ground-truth non-member data.

Abstract

Large Language Models (LLMs) and Vision-Language Models (VLMs) have made significant advancements in a wide range of natural language processing and vision-language tasks. Access to large web-scale datasets has been a key factor in their success. However, concerns have been raised about the unauthorized use of copyrighted materials and potential copyright infringement. Existing methods, such as sample-level Membership Inference Attacks (MIA) and distribution-based dataset inference, distinguish member data (data used for training) and non-member data by leveraging the common observation that models tend to memorize and show greater confidence in member data. Nevertheless, these methods face challenges when applied to LLMs and VLMs, such as the requirement for ground-truth member data or non-member data that shares the same distribution as the test data. In this paper, we propose a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques