PatenTEB: A Comprehensive Benchmark and Model Family for Patent Text Embedding
Iliass Ayaou, Denis Cavallucci

TL;DR
PatenTEB introduces a comprehensive patent-specific benchmark and a versatile model family trained on multiple tasks, significantly advancing patent text embedding quality and generalization for various patent analysis applications.
Contribution
The paper presents PatenTEB, a new benchmark with 15 tasks and 2.06 million examples, and develops the patembed model family with multi-task training and domain-specific enhancements.
Findings
Patembed-base outperforms previous state-of-the-art on BigPatentClustering.v2.
Patembed-large achieves high NDCG@100 on DAPFAM.
Multi-task training enhances external generalization.
Abstract
Patent text embeddings enable prior art search, technology landscaping, and patent analysis, yet existing benchmarks inadequately capture patent-specific challenges. We introduce PatenTEB, a comprehensive benchmark comprising 15 tasks across retrieval, classification, paraphrase, and clustering, with 2.06 million examples. PatenTEB employs domain-stratified splits, domain specific hard negative mining, and systematic coverage of asymmetric fragment-to-document matching scenarios absent from general embedding benchmarks. We develop the patembed model family through multi-task training, spanning 67M to 344M parameters with context lengths up to 4096 tokens. External validation shows strong generalization: patembed-base achieves state-of-the-art on MTEB BigPatentClustering.v2 (0.494 V-measure vs. 0.445 previous best), while patembed-large achieves 0.377 NDCG@100 on DAPFAM. Systematic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗datalyes/patembed-largemodel· 534 dl· ♡ 1534 dl♡ 1
- 🤗datalyes/patembed-basemodel· 62 dl62 dl
- 🤗datalyes/patembed-base_smallmodel
- 🤗datalyes/patembed-smallmodel
- 🤗datalyes/patembed-minimodel
- 🤗datalyes/patembed-nanomodel
- 🤗datalyes/patembed-base_long_1024model
- 🤗datalyes/patembed-base_long_2048model
- 🤗datalyes/patembed-base_long_4096model· 17 dl17 dl
- 🤗datalyes/patembed-large_no_promptsmodel· 4 dl4 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
