Multimodal Models Meet Presentation Attack Detection on ID Documents

Marina Villanueva; Juan M. Espin; Juan E. Tapia

arXiv:2603.29422·cs.CV·April 1, 2026

Multimodal Models Meet Presentation Attack Detection on ID Documents

Marina Villanueva, Juan M. Espin, Juan E. Tapia

PDF

TL;DR

This paper investigates the use of pre-trained multimodal models combining visual and textual data to improve presentation attack detection on ID documents, but finds current models still face challenges in accurate detection.

Contribution

It introduces the application of multimodal models like Paligemma, Llava, and Qwen for PAD on ID documents, highlighting their potential and current limitations.

Findings

01

Multimodal models merge visual embeddings with contextual metadata.

02

Experimental results show models struggle with accurate PAD detection.

03

Current models need further development for reliable biometric security.

Abstract

The integration of multimodal models into Presentation Attack Detection (PAD) for ID Documents represents a significant advancement in biometric security. Traditional PAD systems rely solely on visual features, which often fail to detect sophisticated spoofing attacks. This study explores the combination of visual and textual modalities by utilizing pre-trained multimodal models, such as Paligemma, Llava, and Qwen, to enhance the detection of presentation attacks on ID Documents. This approach merges deep visual embeddings with contextual metadata (e.g., document type, issuer, and date). However, experimental results indicate that these models struggle to accurately detect PAD on ID Documents.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.