DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

Iliass Ayaou (ICube); Denis Cavallucci (ICube); Hicham Chibane (ICube)

arXiv:2506.22141·cs.CL·March 3, 2026

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

Iliass Ayaou (ICube), Denis Cavallucci (ICube), Hicham Chibane (ICube)

PDF

Open Access 1 Datasets

TL;DR

DAPFAM is a new benchmark dataset designed to evaluate cross-domain patent retrieval, highlighting significant domain gaps and providing a platform for developing more robust IR systems.

Contribution

It introduces DAPFAM, a family-level patent retrieval benchmark with explicit domain partitions, and provides comprehensive experimental analysis across various retrieval methods.

Findings

01

OUT-domain performance is five times lower than IN-domain.

02

Passage-level retrieval outperforms document-level retrieval.

03

Dense retrieval methods offer modest gains over BM25.

Abstract

Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

datalyes/DAPFAM_patent
dataset· 595 dl
595 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntellectual Property and Patents