DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval
Iliass Ayaou (ICube), Denis Cavallucci (ICube), Hicham Chibane (ICube)

TL;DR
DAPFAM is a new benchmark dataset designed to evaluate cross-domain patent retrieval, highlighting significant domain gaps and providing a platform for developing more robust IR systems.
Contribution
It introduces DAPFAM, a family-level patent retrieval benchmark with explicit domain partitions, and provides comprehensive experimental analysis across various retrieval methods.
Findings
OUT-domain performance is five times lower than IN-domain.
Passage-level retrieval outperforms document-level retrieval.
Dense retrieval methods offer modest gains over BM25.
Abstract
Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntellectual Property and Patents
