Citation-Driven Multi-View Training for Patent Embeddings: QaECTER and Sophia-Bench
Younes Djemmal, You Zuo (ALMAnaCH), Kim Gerdes (LISN, Qatent), Kirian Guiller

TL;DR
This paper introduces Sophiabench, a comprehensive patent retrieval benchmark, and QaECTER, a compact embedding model that achieves state-of-the-art results across diverse patent search scenarios.
Contribution
The paper presents a new large-scale patent retrieval benchmark and a novel embedding model that outperforms larger models across multiple patent search tasks.
Findings
QaECTER outperforms a 23x larger model on patent retrieval benchmarks.
Sophiabench evaluates retrieval across 12 query types and multiple jurisdictions.
QaECTER surpasses all prior models without task-specific prompts.
Abstract
Patent retrieval underpins critical decisions in innovation, examination, and IP strategy, yet progress has been hampered by the absence of benchmarks that reflect the diversity of real world search scenarios. We address this gap with two contributions. First, we introduce Sophiabench, a large-scale patent retrieval benchmark comprising 10,000 queries and 75,000 corpus documents stratified across ten years, eight IPC technology sections, and twelve filing jurisdictions. Unlike prior benchmarks, Sophia-bench tests retrieval using 12 different query types-from structured patent fields to AI-generated summaries-and evaluates results against citation-based ground truth enhanced with a novel domain-relevance metric (InScope). Together, these enable systematic measurement of how well models perform across query types, technology domains, and jurisdictions. Second, we introduce QaECTER, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
