DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Andr\'e V. Duarte; Xuandong Zhao; Arlindo L. Oliveira; Lei Li

arXiv:2502.17358·cs.CV·June 3, 2025

DIS-CO: Discovering Copyrighted Content in VLMs Training Data

Andr\'e V. Duarte, Xuandong Zhao, Arlindo L. Oliveira, Lei Li

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

DIS-CO is a novel method that detects copyrighted content in vision-language models by querying them with specific frames and analyzing their responses, revealing widespread exposure to copyrighted material.

Contribution

The paper introduces DIS-CO, a new approach for identifying copyrighted content in VLMs, along with MovieTection, a benchmark dataset for evaluation.

Findings

01

DIS-CO nearly doubles detection accuracy over prior methods.

02

All tested models show some exposure to copyrighted content.

03

DIS-CO effectively infers training data content without direct access.

Abstract

How can we verify whether copyrighted content was used to train a large vision-language model (VLM) without direct access to its training data? Motivated by the hypothesis that a VLM is able to recognize images from its training corpus, we propose DIS-CO, a novel approach to infer the inclusion of copyrighted content during the model's development. By repeatedly querying a VLM with specific frames from targeted copyrighted material, DIS-CO extracts the content's identity through free-form text completions. To assess its effectiveness, we introduce MovieTection, a benchmark comprising 14,000 frames paired with detailed captions, drawn from films released both before and after a model's training cutoff. Our results show that DIS-CO significantly improves detection performance, nearly doubling the average AUC of the best prior method on models with logits available. Our findings also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

avduarte333/dis-co
pytorchOfficial

Datasets

Videos

DIS-CO: Discovering Copyrighted Content in VLMs Training Data· slideslive

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Topic Modeling · Imbalanced Data Classification Techniques