Reconstructing training data from document understanding models

J\'er\'emie Dentan; Arnaud Paran; Aymen Shabou

arXiv:2406.03182·cs.CR·June 6, 2024

Reconstructing training data from document understanding models

J\'er\'emie Dentan, Arnaud Paran, Aymen Shabou

PDF

Open Access

TL;DR

This paper introduces CDMI, a novel reconstruction attack that exposes privacy vulnerabilities in document understanding models like LayoutLM and BROS, revealing sensitive training data fields and highlighting privacy risks.

Contribution

The paper presents the first reconstruction attack for document understanding models, demonstrating privacy risks and proposing evaluation metrics and defenses.

Findings

01

Reconstructed up to 4.1% of document fields used in training.

02

Combined attack improves accuracy to 22.5%.

03

Analyzed effects of overfitting and model type on attack success.

Abstract

Document understanding models are increasingly employed by companies to supplant humans in processing sensitive documents, such as invoices, tax notices, or even ID cards. However, the robustness of such models to privacy attacks remains vastly unexplored. This paper presents CDMI, the first reconstruction attack designed to extract sensitive fields from the training data of these models. We attack LayoutLM and BROS architectures, demonstrating that an adversary can perfectly reconstruct up to 4.1% of the fields of the documents used for fine-tuning, including some names, dates, and invoice amounts up to six-digit numbers. When our reconstruction attack is combined with a membership inference attack, our attack accuracy escalates to 22.5%. In addition, we introduce two new end-to-end metrics and evaluate our approach under various conditions: unimodal or bimodal data, LayoutLM or BROS…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies