Automating the search for a patent's prior art with a full text   similarity search

Lea Helmers; Franziska Horn; Franziska Biegler; Tim Oppermann,; Klaus-Robert M\"uller

arXiv:1901.03136·cs.IR·March 6, 2019

Automating the search for a patent's prior art with a full text similarity search

Lea Helmers, Franziska Horn, Franziska Biegler, Tim Oppermann,, Klaus-Robert M\"uller

PDF

1 Repo

TL;DR

This paper presents an automated method using machine learning and NLP to improve and speed up the patent prior art search process by comparing full texts of patents.

Contribution

It introduces a novel full-text similarity search approach for patents, outperforming traditional keyword-based methods in both speed and quality.

Findings

01

Automated approach accelerates prior art search process.

02

Improves the relevance and quality of search results.

03

Evaluation shows better performance compared to existing methods.

Abstract

More than ever, technical inventions are the symbol of our society's advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore, a search for published work that describes similar inventions to a given patent application needs to be performed. Currently, this so-called search for prior art is executed with semi-automatically composed keyword queries, which is not only time consuming, but also prone to errors. In particular, errors may systematically arise by the fact that different keywords for the same technical concepts may exist across disciplines. In this paper, a novel approach is proposed, where the full text of a given patent application is compared to existing patents using machine learning and natural language processing techniques to automatically detect inventions…

Tables7

Table 1. Table 1 : Evaluation results on the cited/random dataset. AUC values when computing the cosine similarity with BOW, LSA, KPCA, word2vec , and doc2vec features constructed from different patent sections of the cited/random dataset.

Features	patent section: AUC
	full text	abstract	claims
Bag-of-words	0.9560	0.8620	0.8656
LSA	0.9361	0.8579	0.8561
KPCA	0.9207	0.8377	0.8250
BOW + word2vec	0.9410	0.8618	0.8525
doc2vec	0.9314	0.8919	0.8898

Table 2. Table 2 : Confusion matrix for the dataset subsample. The original cited/random labelling is compared to the more accurate relevant/irrelevant labels.

	cited	random
relevant	65	18
irrelevant	86	281

Table 3. Table 3 : Correlations between labels and similarity scores on the dataset subsample. Spearman’s ρ 𝜌 \rho for the cosine similarity calculated with BOW feature vectors and the relevant/irrelevant and cited/random labelling.

	cited/random	relevant/irr.
cosine (BOW)	0.501	0.652
relevant/irr.	0.592	—

Table 4. Table 4 : Summary of evaluation results. AUC and average precision (AP) scores for the different feature extraction methods on the dataset subsample with cited/random and relevant/irrelevant labelling, as well as the full dataset.

Features	AUC			AP
	subsample		full	subsample		full
	relevant	cited	cited	relevant	cited	cited
Bag-of-words	0.8118	0.8063	0.9560	0.5274	0.7095	0.4705
LSA	0.7798	0.7075	0.9361	0.4787	0.5921	0.3257
KPCA	0.7441	0.6740	0.9207	0.4721	0.5832	0.2996
BOW + word2vec	0.8408	0.8544	0.9410	0.5443	0.7354	0.4019
doc2vec	0.7658	0.8138	0.9314	0.4749	0.6829	0.3121

Table 5. Table 5 : Overview of similarity measures for sequential data [ 55 ] .

Similarity coefficients
Cosine	$\frac{\sum_{w \in L} Φ_{w} (x_{i}) Φ_{w} (x_{j})}{\sqrt{\sum_{w} Φ_{w} {(x_{i})}^{2}} \sqrt{\sum_{w} Φ_{w} {(x_{j})}^{2}}}$
Braun-Blanquet	$\frac{a}{\max (a + b, a + c)}$
Czekanowski, Sørensen-Dice	$\frac{2 a}{(2 a + b + c)}$
Jaccard	$\frac{a}{(a + b + c)}$
Kulczynski	$\frac{a}{2 (a + b)} + \frac{a}{2 (a + c)}$
Otsuka, Ochiai	$\frac{a}{\sqrt{(a + b) (a + c)}}$
Simpson	$\frac{a}{\min (a + b, a + c)}$
Sokal-Sneath, Anderberg	$\frac{a}{(a + 2 (b + c))}$
Kernel functions
Linear	$\sum_{w \in L} Φ_{w} (x_{i}) Φ_{w} (x_{j})$
Gaussian	$\exp (\frac{- d {(x_{i}, x_{j})}^{2}}{2 σ^{2}})$
Histogram intersection	$\sum_{w \in L} \min (Φ_{w} (x_{i}), Φ_{w} (x_{j}))$
Polynomial	${(\sum_{w \in L} Φ_{w} (x_{i}) Φ_{w} (x_{j}) + Θ)}^{p}$
Sigmoidal	$\tanh (\sum_{w \in L} Φ_{w} (x_{i}) Φ_{w} (x_{j}) + Θ)$
Distance functions
Canberra	$\sum_{w \in L} \frac{\| Φ_{w} (x_{i}) - Φ_{w} (x_{j}) \|}{Φ_{w} (x_{i}) + Φ_{w} (x_{j})}$
Chebyshev	$\max_{w \in L} \| Φ_{w} (x_{i}) - Φ_{w} (x_{j}) \|$
Euclidean	$\sum_{w \in L} {\| Φ_{w} (x_{i}) - Φ_{w} (x_{j}) \|}^{2}$
Geodesic	$\arccos \sum_{w \in L} Φ_{w} (x_{i}) Φ_{w} (x_{j})$
Hellinger²	$\sum_{w \in L} {(\sqrt{Φ_{w} (x_{i})} - \sqrt{Φ_{w} (x_{j})})}^{2}$
Jensen-Shannon	$\sum_{w \in L} H (Φ_{w} (x_{i}), Φ_{w} (x_{j}))$
Manhattan	$\sum_{w \in L} \| Φ_{w} (x_{i}) - Φ_{w} (x_{j}) \|$
Minkowski^p	$\sum_{w \in L} {\| Φ_{w} (x_{i}) - Φ_{w} (x_{j}) \|}^{p}$
$χ^{2}$	$\sum_{w \in L} \frac{{(Φ_{w} (x_{i}) - Φ_{w} (x_{j}))}^{2}}{Φ_{w} (x_{i}) + Φ_{w} (x_{j})}$

Table 6. Table 6 : CPC table for the subcategories of class A61 ( medical or veterinary science and hygiene ).

A61B	Diagnosis
	Surgery
	Identification
A61C	Dentistry
A61C	Apparatus or methods for oral or dental hygiene
A61D	Veterinary instruments, implements, tools, or methods
A61F	Filters implantable into blood vessels
	Prostheses
	Devices providing patency to or preventing collapsing of tubular structures of the body, e.g. stents
	Orthopaedic, nursing or contraceptive devices
	Fomentation
	Treatment or protection of eyes or ears
	Bandages, dressings or absorbent pads
	First-aid kits
A61G	Transport or accomodation for patients
	Operating tables or chairs
	Chairs for dentistry
	Funeral devices
A61H	Physical therapy apparatus, e.g. devices for locating or stimulating reflex points in the body
	Artificial respiration
	Massage
	Bathing devices for special therapeutic or hygienic purposes or specific parts of the body
A61J	Containers specially adapted for medical or pharmaceutical purposes
	Devices or methods specially adapted for bringing pharmaceutical products into particular physical or administering forms
	Devices for administering food or medicines orally
	Baby comforters
	Devices for receiving spittle
A61K	Preparations for medical, dental, or toilet purposes
A61L	Methods or apparatus for sterilising materials or objects in general
	Disinfection, sterilisation, or deodorisation of air
	Chemical aspects of bandages, dressings, absorbent pads, or surgical articles
	Materials for bandages, dressings, absorbent pads, or surgical articles
A61M	Devices for introducing media into, or onto, the body
	Devices for transducing body media or for taking media from the body
	Devices for producing or ending sleep or stupor
A61N	Electrotherapy
	Magnetotherapy
	Radiation therapy
	Ultrasound therapy
A61Q	Specific use of cosmetics or similar toilet preparations

Table 7. Table 7 : AUC scores for all the tested combinations of BOW feature extraction approaches and similarity functions on the cited/random corpus. The best result for each similarity function is printed in bold and the best result for each function class is underlined. ∗ The linear kernel with length normalized vectors corresponds to the cosine similarity. + The AUC is equal, as for length normalized vectors (i.e. ‖ 𝐱 i ‖ 2 = ‖ 𝐱 j ‖ 2 = 1 subscript norm subscript 𝐱 𝑖 2 subscript norm subscript 𝐱 𝑗 2 1 \|\mathbf{x}_{i}\|_{2}=\|\mathbf{x}_{j}\|_{2}=1 ), we get ‖ 𝐱 i − 𝐱 j ‖ 2 2 = ( 𝐱 i − 𝐱 j ) T ( 𝐱 i − 𝐱 j ) = 𝐱 i T 𝐱 i − 2 𝐱 i T 𝐱 j + 𝐱 j T 𝐱 j = 1 − 2 𝐱 i T 𝐱 j + 1 = 2 − 2 𝐱 i T 𝐱 j superscript subscript norm subscript 𝐱 𝑖 subscript 𝐱 𝑗 2 2 superscript subscript 𝐱 𝑖 subscript 𝐱 𝑗 𝑇 subscript 𝐱 𝑖 subscript 𝐱 𝑗 superscript subscript 𝐱 𝑖 𝑇 subscript 𝐱 𝑖 2 superscript subscript 𝐱 𝑖 𝑇 subscript 𝐱 𝑗 superscript subscript 𝐱 𝑗 𝑇 subscript 𝐱 𝑗 1 2 superscript subscript 𝐱 𝑖 𝑇 subscript 𝐱 𝑗 1 2 2 superscript subscript 𝐱 𝑖 𝑇 subscript 𝐱 𝑗 \|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}=(\mathbf{x}_{i}-\mathbf{x}_{j})^{T}(\mathbf{x}_{i}-\mathbf{x}_{j})=\mathbf{x}_{i}^{T}\mathbf{x}_{i}-2\mathbf{x}_{i}^{T}\mathbf{x}_{j}+\mathbf{x}_{j}^{T}\mathbf{x}_{j}=1-2\mathbf{x}_{i}^{T}\mathbf{x}_{j}+1=2-2\mathbf{x}_{i}^{T}\mathbf{x}_{j} and 𝐱 i T 𝐱 j superscript subscript 𝐱 𝑖 𝑇 subscript 𝐱 𝑗 \mathbf{x}_{i}^{T}\mathbf{x}_{j} is equal to the cosine similarity.

Similarity coefficients
	normalization	tf	0/1	tf-idf	0/1-idf
Braun-Blanquet	length	0.8550	0.7941	0.9480	0.8791
	max	0.8075	0.7941	0.9338	0.8753
Czekanowski, Sørensen-Dice	length	0.8749	0.8371	0.9555	0.9021
	max	0.8593	0.8680	0.9505	0.9144
Jaccard	length	0.8749	0.8371	0.9555	0.9021
	max	0.8593	0.8680	0.9505	0.9144
Kulczynski	length	0.8767	0.8536	0.9574	0.9122
	max	0.8761	0.9079	0.9571	0.9266
Otsuka, Ochiai	length	0.8759	0.8451	0.9568	0.9072
	max	0.8687	0.8982	0.9558	0.9323
Simpson	length	0.8566	0.8982	0.9543	0.9268
	max	0.8190	0.7879	0.9479	0.8685
Sokal-Sneath, Anderberg	length	0.8749	0.8371	0.9555	0.9021
	max	0.8593	0.8680	0.9505	0.9144
Kernel functions
	normalization	tf	0/1	tf-idf	0/1-idf
Linear	length^∗+	0.7336	0.8982	0.9560	0.9470
	max	0.5411	0.7142	0.9387	0.8168
Gaussian	length	0.7336	0.8982	0.9560	0.9470
	max	0.6909	0.5010	0.6366	0.5083
Histogram intersection	length	0.7853	0.8050	0.9239	0.8759
	max	0.6939	0.7142	0.8969	0.7694
Polynomial	length	0.7336	0.8982	0.9480	0.9468
	max	0.5411	0.7142	0.9383	0.8168
Sigmoidal	length	0.7336	0.8982	0.9560	0.9470
	max	0.5411	0.5000	0.9387	0.7971
Distance functions
	normalization	tf	0/1	tf-idf	0/1-idf
Canberra	length	0.5253	0.5479	0.6523	0.6184
	max	0.5259	0.5937	0.6072	0.5438
Chebyshev	length	0.6252	0.5686	0.6056	0.5473
	max	0.6271	0.5000	0.6162	0.5006
Hellinger	length	0.8746	0.6709	0.7213	0.6183
	max	0.8064	0.5937	0.6559	0.5788
Jensen-Shannon	length	0.8607	0.6699	0.7028	0.6173
	max	0.7889	0.5937	0.6415	0.5787
Manhattan	length	0.7987	0.6486	0.6437	0.6002
	max	0.7239	0.5937	0.5997	0.5767
Minkowski ( $p = 3$ )	length	0.6765	0.7203	0.6934	0.5930
	max	0.6606	0.5937	0.6616	0.5846
Euclidean	length⁺	0.7336	0.8982	0.9560	0.9470
	max	0.6909	0.5937	0.6366	0.5794
$χ^{2}$	length	0.8476	0.6686	0.7567	0.6134
	max	0.7739	0.5937	0.6180	0.5543

Equations20

sim (x_{i}, x_{j})

sim (x_{i}, x_{j})

A P = n \sum (R_{n} - R_{n - 1}) P_{n} .

A P = n \sum (R_{n} - R_{n - 1}) P_{n} .

idf (w)

idf (w)

x_{k} (w)

x_{k} (w)

\tilde{x}_{k} = \frac{x _{k}}{max _{w} x _{k} ( w )}

\tilde{x}_{k} = \frac{x _{k}}{max _{w} x _{k} ( w )}

k (x_{i}, x_{j}) = ⟨ f (x_{i}), f (x_{j})⟩,

k (x_{i}, x_{j}) = ⟨ f (x_{i}), f (x_{j})⟩,

a

a

b

c

T P R

T P R

F P R

F N R

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

helmersl/patent_similarity_search
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Automating the search for a patent’s prior art with a full text similarity search

Lea Helmers1

Franziska Horn1

Franziska Biegler2

Tim Oppermann2

Klaus-Robert Müller1,3,4

Abstract

More than ever, technical inventions are the symbol of our society’s advance. Patents guarantee their creators protection against infringement. For an invention being patentable, its novelty and inventiveness have to be assessed. Therefore, a search for published work that describes similar inventions to a given patent application needs to be performed. Currently, this so-called search for prior art is executed with semi-automatically composed keyword queries, which is not only time consuming, but also prone to errors. In particular, errors may systematically arise by the fact that different keywords for the same technical concepts may exist across disciplines.

In this paper, a novel approach is proposed, where the full text of a given patent application is compared to existing patents using machine learning and natural language processing techniques to automatically detect inventions that are similar to the one described in the submitted document. Various state-of-the-art approaches for feature extraction and document comparison are evaluated. In addition to that, the quality of the current search process is assessed based on ratings of a domain expert. The evaluation results show that our automated approach, besides accelerating the search process, also improves the search results for prior art with respect to their quality.

patents, search, prior art, information retrieval

1Machine Learning Group, Technische Universität Berlin, Berlin, Germany

2Pfenning, Meinig & Partner mbB, Berlin, Germany

3Department of Brain and Cognitive Engineering, Korea University, Anam-dong, Seongbuk-gu, Seoul 02841, Korea

4Max-Planck-Institut für Informatik, Saarbrücken, Germany

[email protected], [email protected]

1 Introduction

A patent is the exclusive right to manufacture, use, or sell an invention and is granted by the government’s patent offices [54]. For a patent to be granted, it is indispensable that the described invention is not known or easily inferred from the so-called prior art, where prior art includes any written or oral publication available before the filing date of the submission. Therefore, for each application that is submitted, the responsible patent office performs a search for related work to check if the subject matter described in the submission is inventive enough to be patentable [54]. Before handing in the application to the patent office, the inventors will usually consult a patent attorney, who represents them in obtaining the patent. In order to assess the chances of the patent being granted, the patent attorney often also performs a search for prior art.

When searching for prior art, patent officers and patent attorneys are currently mainly relying on simple keyword searches such as those implemented by the Espacenet tool from the European Patent Office, the TotalPatent software developed by LexisNexis, or the PatSnap patent search, all of which provide very limited semantic search options. These search engines often fail to return relevant documents and due to constraints regarding the length of the entered search text, it is usually not possible to consider a patent application’s entire text for the search, but merely query the database for specific keywords.

Current search approaches for prior art therefore require a significant amount of manual work and time, as given a patent application, the patent officer or attorney has to manually formulate a search query by combining words that should match documents describing similar inventions [5]. Furthermore, these queries often have to be adapted several times to optimize the output of the search [19, 66]. A main problem here is that regular keyword searches do not inherently take into account synonyms or more abstract terms related to the given query words. This means, if for an important term in the patent application a synonym, such as wire instead of cable, or a more specialized term, such as needle instead of sharp object, has been used in an existing document of prior art, a keyword search might fail to reveal this relation unless the alternative term was explicitly included in the search query. This is relevant as it is quite common in patent texts to use very abstract and general terms for describing an invention in order to maximize the protective scope [63, 6]. A line of research [24, 4, 34, 32, 60] has focused on automatically expanding the manually composed queries, e.g., to take into account synonyms collected in a thesaurus [36, 34] or include keywords occurring in related patent documents [17, 38, 39]. Yet, with iteratively augmented queries – be it by manual or automatic extension of the query – the search for prior art remains a very time consuming process.

Furthermore, a keyword-based search for prior art, even if done with most professional care, will often produce suboptimal results (as we will see e.g. later in this paper and Supporting Information D.2). With possibly imperfect queries, it must be assumed that relevant documents are missed in the search, leading to false negatives (FN). On the other hand, query words can also appear in texts that, nonetheless, have quite different topics, which means the search will additionally yield many false positives (FP). When searching for prior art for a patent application, the consequences of false positives and false negatives are quite different. While false positives cause additional work for the patent examiner, who has to exclude the irrelevant documents from the report, false negatives may lead to an erroneous grant of a patent, which can have profound legal and financial implications for both the owner of said patent as well as competitors [65].

1.1 An approach to automate the search for prior art

To overcome some of these disadvantageous aspects of current keyword-based search approaches, it is necessary to decrease the manual work and time required for conducting the search itself, while increasing the quality of the search results by avoiding irrelevant patents from being returned, as well as automatically accounting for synonyms to reduce false negatives. This can be achieved by comparing the patent application with existing publications based on their entire texts rather than just searching for specific keywords. By considering the entire texts of the documents, much more information, including the context of keywords used within the respective documents, is taken into account. For humans it is of course infeasible to read the whole text of each possibly relevant document. Instead, state-of-the-art text processing techniques can be used for this task.

This paper describes a novel approach to automate the search for prior art with natural language processing (NLP) and machine learning (ML) techniques, such as neural network language models, in order to make it more efficient and accurate. The essence of this idea is illustrated in Fig 1. We first obtain a dataset of related patents from a patent database by using a few manually selected seed patents and then recursively adding the patents or patent applications that are cited by the documents already included in the dataset. The patent texts are then transformed into numerical feature vectors, based on which the similarity between two documents can be computed. We evaluate different similarity measures by comparing the prior art suggested by our automated approach to those documents that were originally cited in a patent’s search report and, in a second step, to documents considered relevant prior art for this patent by a patent attorney. By analyzing and comparing different approaches for computing full text similarities between patent documents, we aim to identify a similarity measure based on which it is possible to automatically and reliably select relevant prior art given, e.g., the draft of a new patent application.

The remainder of the paper is structured as follows: After briefly reviewing existing strategies for prior art search as well as machine learning methods for full text similarity search and its applications, we discuss our approach for computing the similarities between the patents using different feature extraction methods. These methods are then evaluated on an example dataset of patents including their citations, as well as a second dataset where relevant patents were identified by a patent attorney. Furthermore, based on this manually annotated dataset, we also assess the quality of the original citation process itself. A discussion of the relevance of the obtained results and a brief outlook conclude this manuscript.

1.2 Related work

Most research concerned with facilitating and improving the search for a patent’s prior art has focused on automatically composing and extending the search queries. For example, a manually formulated query can be improved by automatically including synonyms for the keywords using a thesaurus [36, 63, 34, 37, 70]. A potential drawback of such an approach, however, is that the thesaurus itself has to be manually curated and extended [72]. Another line of research focuses on pseudo-relevance feedback, where, given an initial search, the first $k$ search results are used to identify additional keywords that can be used to extend the original query [38, 18, 19]. Similarly, past queries [62] or meta data such as citations can be used to augment the search query [17, 39, 40]. A recent study has also examined the possibility of using the word2vec language model [44, 45, 46] to automatically identify relevant words in the search results that can be used to extend the query [61].

Approaches for automatically adapting and extending queries still require the patent examiner to manually formulate the initial search query. To make this step obsolete, heuristics can be used to automatically extract keywords from a given patent application [41, 25, 68] or a bag-of-words (BOW) approach can be used to transform the entire text of a patent into a list of words that can then be used to search for its prior art [67, 12, 71]. Often times, partial patent applications, such as an extended abstract, may already suffice to conduct the search [12]. The search results can also be further refined with a graph-based ranking model [43] or by using the patents’ categories to filter the results [69]. Different prior art search approaches have previously been discussed and benchmarked within the CLEF-IP project, see e.g. [51] and [53].

In our approach, detailed in the following sections, we also alleviate the required work and time needed to manually compose a search query by simply operating on the patent application’s entire text. However, instead of only searching the database for relevant keywords extracted from this text, we transform the texts of all other documents into numerical feature representations as well, which allow us to compute the full text similarities between the patent application and its possible prior art.

Calculating the similarity between texts is at the heart of a wide range of information retrieval tasks, such as search engine development, question answering, document clustering, or corpus visualization. Approaches for computing text similarities can be divided into similarity measures relying on word similarities and those based on document feature vectors [20].

To compute the similarity between two texts using individual word similarities, the words in both texts first have to be aligned by creating word pairs based on semantic similarity and then these similarity scores are combined to yield a similarity measure for the whole text. Corley and Mihalcea [13] propose a text similarity measure, where the most similar word pairs in two texts are determined based on semantic word similarity measures as implemented in the WordNet similarity package [49]. The similarity score of two texts is then computed as the weighted and normalized sum of the single word pairs’ similarity scores. This approach can be further refined using greedy pairing [31]. Recently, instead of using WordNet relations to obtain word similarities, the similarity between semantically meaningful word embeddings, such as those created by the word2vec language model [44], was used. Kusner et al. [26] defined the word mover’s distance for computing the similarity between two sentences as the minimum distance the individual word embeddings have to move to match those of the other sentence. While similarity measures based on the semantic similarities of individual words are advantageous when comparing short texts, finding an optimal word pairing for longer texts is computationally very expensive and therefore these similarity measures are less practical in our setting, where the full texts of whole documents have to be compared.

To compute the similarity between longer documents, these can be transformed into numerical feature vectors, which serve as input to a similarity function. Rieck and Laskov [55] give a comprehensive overview of similarity measures for sequential data, some of which are widely used in information retrieval applications. Achananuparp et al. [3] test some of these similarity measures for comparing sentences on three corpora, using accuracy, precision, recall, and rejection as metrics to evaluate how many of the retrieved documents are relevant in relation to the number of relevant documents missed. Huang [23] use several of these similarity measures to perform text clustering on tf-idf vectors. Interested in how well similarity measures reproduce human similarity ratings, Lee et al. [29] create a text similarity corpus based on all possible pairs of 50 different documents rated by 83 students. They test different feature extraction methods in combination with four of the similarity measures described in Rieck and Laskov [55] and calculate the correlation of the human ratings with the resulting scoring. They conclude that using the cosine similarity, high precision can be achieved, while recall is still not satisfying.

Full text similarity measures have previously been used to improve search results for MEDLINE articles, where a two step approach using the cosine similarity measure between tf-idf vectors in combination with a sentence alignment algorithm yielded superior results compared to the boolean search strategy used by PubMed [30]. The Science Concierge [2] computes the similarities between papers’ abstracts to provide content based recommendations, however it still requires an initial keyword search to retrieve articles of interest. The PubVis web application by Horn [21], developed for visually exploring scientific corpora, also provides recommendations for similar articles given a submitted abstract by measuring overlapping terms in the document feature vectors. While full text similarity search approaches have shown potential in domains such as scientific literature, only few studies have explored this approach for the much harder task of retrieving prior art for a new patent application [47], where much less overlap between text documents is to be expected due to the usage of very abstract and general terms when describing new inventions. Specifically, document representations created using recently developed neural network language models such as word2vec [44, 45, 22] or doc2vec [28] were not yet evaluated on patent documents.

2 Methods

In order to study our hypothesis that the search for prior art can be improved by automatically determining, for a given patent application, the most similar documents contained in the database based on their full texts, we need to evaluate multiple approaches for comparing the patents’ full texts and computing similarities between the documents. To do this, we test multiple approaches for creating numerical feature representations from the documents’ raw texts, which can then be used as input to a similarity function to compute the documents’ similarity.

All raw documents first have to be preprocessed by lower casing and removing non-alphanumeric characters. The simplest way of transforming texts into numerical vectors is to create high dimensional but sparse bag-of-words (BOW) vectors with tf-idf features [42]. These BOW representations can also be reduced to their most expressive dimensions using dimensionality reduction methods such as latent semantic analysis (LSA) [27, 47] or kernel principal component analysis (KPCA) [59, 48, 57, 58]. Alternatively, the neural network language models (NNLM) [11] word2vec [44, 45] (combined with BOW vectors) or doc2vec [28] can be used to transform the documents into feature vectors. All these feature representations are described in detail in the Supporting Information A.1.

Using any of these feature representations, the pairwise similarity between two documents’ feature vectors $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ can be calculated using the cosine similarity:

[TABLE]

which is $1$ for documents that are (almost) identical, and [math] (in the case of non-negative BOW feature vectors) or below [math] for unrelated documents [14, 23, 9]. Other possible similarity functions for comparing sequential data [55, 50] are discussed in the Supporting Information A.2.

3 Data

Our experiments are conducted on two datasets, created using a multi-step process as briefly outlined here and further discussed in the Supporting Information B. For ease of notation, we use the term patent when really referring to either a granted patent or a patent application.

We first obtained a patent corpus containing more than 100,000 patent documents from the Cooperative Patent Classification scheme (CPC) category A61 (medical or veterinary science and hygiene), published between 2000 and 2015. From these documents, our first dataset was compiled, starting with the roughly 2,500 patents in the corpus published in 2015, which we will refer to as “target patents” in the remaining text. Each of the target patents cites on average 17.5 (standard deviation: $\pm$ 28.4) other patents in our corpus (i.e. published after 2000), which we also include in the dataset. Additionally, we randomly selected another 1,000 patents from the corpus, which were not cited by any of the selected target patents. This results in altogether 28,381 documents, which contain on average 13,530 ( $\pm$ 18,750) words. From these documents, the first dataset was then created by pairing up the patents and assigning each patent pair a corresponding label: Each target patent is paired up with a) all the patents it cites, these patent pairs are assigned the label ‘cited’, and b) the 1,000 patents not cited by any of the target patents, these patent pairs are labelled ‘random’. This first dataset consists of 2,470,736 patent pairs with a ‘cited/random’ labelling.

The second dataset is created by obtaining additional, more consistent human labels from a patent attorney for a small subset of the first dataset. These labels should show which of the cited patents are truly relevant to the target patent and whether important prior art is missing from the search reports. For ten of the target patents, we selected their respective cited patents as well as several random patents that either obtained a relatively high, medium, or low similarity score as computed with the cosine similarity on tf-idf BOW features. These 450 patent pairs were then manually assigned ‘relevant/irrelevant’ labels and constitute our second dataset.

4 Evaluation

A pair of patents should have a high similarity score if the two texts address a similar or almost identical subject matter, and a low score if they are unrelated. Furthermore, if two patent documents address a similar subject matter, then one document of said pair should have been cited in the search report of the other. To evaluate the similarity computation with different feature representations, the task of finding similar patents can be modelled as a binary classification problem, where the samples correspond to pairs of patents. A patent pair is given a positive label, if one of the patents was cited by the other, and a negative label otherwise. We can then compute similarity scores for all pairs of patents and select a threshold for the score where we say all patent pairs with a similarity score higher than this threshold are relevant for each other while similarity scores below the threshold indicate the patents in this pair are unrelated. With a meaningful similarity measure, it should be possible to choose a threshold such that most patent pairs associated with a positive label have a similarity score above the threshold and the pairs with negative labels score below the threshold, i.e., the two similarity score distributions should be well separated. For a given threshold, we can compute the true positive rate (TPR), also called recall, and the false positive rate (FPR) of the similarity measure. By plotting the TPR against the FPR for different decision thresholds, we obtain the graph of the receiver operating characteristic (ROC) curve, where the area under the ROC curve (AUC) conveniently translates the performance of the similarity measure into a number between $0.5$ (similarity scores assigned to patent pairs with a ‘cited’ relationship and randomly paired patents are in the same range) and $1$ (semantically related patents receive consistently higher similarity scores than unrelated patent pairs). Further details on this performance measure can be found in the Supporting Information C.

While the AUC is a very useful measure to select a similarity function based on which relevant and irrelevant patents can be reliably separated, the exact score also depends on characteristics of the dataset and may therefore seem overly optimistic [56]. Especially in our first dataset, many of the randomly selected patents contain little overlap with the target patents and can therefore be easily identified as irrelevant. With only a small fraction of the random pairs receiving a medium or high similarity score, this means that for most threshold values the FPR will be very low, resulting in larger AUC values. To give a further perspective on the performance of the compared similarity measures, we therefore additionally report the average precision (AP) score for the final results. For a specific threshold, precision is defined as the number of TP relative to the number of all returned documents, i.e., TP+FP. As we rank the patent pairs based on their similarity score, precision and recall can again be plotted against each other for $n$ different thresholds and the area under this curve can be computed as the weighted average of precision ( $P$ ) and recall ( $R$ ) for all $n$ threshold values [73]:

[TABLE]

5 Results

The aim of our study is to identify a robust approach for computing the full text similarity between two patents. To this end, in the following we evaluate different document feature representations and similarity functions by assessing how well the computed similarity scores are aligned with the labels of our two datasets, i.e., whether a high similarity score is assigned to pairs that are labelled as cited (relevant) and low similarity scores to random (irrelevant) pairs. Furthermore, we examine the discrepancies between patents cited in a patent application’s search report and truly relevant prior art. The data and code to replicate the experiments is available online.111https://github.com/helmersl/patent_similarity_search

5.1 Using full text similarity to identify cited patents

The similarities between the patents in each pair contained in the cited/random dataset are computed using the different feature extraction methods together with the cosine similarity and the obtained similarity scores are then evaluated by computing the AUC with respect to the pairs’ labels (Table 1). The similarity scores are computed using either the full texts of the patents to create the feature vectors, or only parts of the documents, such as the patents’ abstracts or their claims, to identify which sections are most relevant for this task [15, 12]. Additionally, the results on this dataset using BOW feature vectors together with other similarity measures can be found in the Supporting Information D.1.

The BOW features outperform the tested dimensionality reduction methods LSA and KPCA as well as the NNLM word2vec and doc2vec when comparing the patents’ full texts (Table 1). Yet, with AUC values greater than 0.9, all methods succeed in identifying cited patents by assigning the patents found in a target patent’s search report a higher similarity score than those that they were paired up with randomly. When only certain patent sections are taken into account, the NNLMs perform as good (word2vec) or even better (doc2vec) than the BOW vectors, and LSA performs well on the claims section as well. The comparably good performance, especially of doc2vec, on individual sections is probably due to the fact that these feature representations are more meaningful when computed for shorter texts, whereas when combining the embedding vectors of too many individual words, the resulting document representation can be rather noisy.

When looking more closely at the score distributions obtained with BOW features on the patents’ full texts as well as their claims sections (Fig 2), it can be seen that when only using the claims sections, the scores of the duplicate patent pairs, instead of being clustered near $1$ , range nearly uniformly between [math] and $1$ . This can be explained by divisional applications and the fact that during the different stages of a submission process, most of the time only the claims section is revised (usually by weakening the claims), such that several versions of a patent application will essentially differ from each other only in their claims whereas abstract and description remain largely unchanged [71, 12].

5.2 Identifying truly relevant patents

The search for prior art for a given patent application is in general conducted by a single person using mainly keyword searches, which might result in false positives as well as false negatives. Furthermore, as different patent applications are handled by different patent examiners, it is difficult to obtain a consistently labelled dataset. A more reliably labelled dataset would therefore be desirable to properly evaluate our automatic search approach. In the previous section, we showed that by computing the cosine similarity between feature vectors created from full patent texts we can identify patents that occur in the search report of a target patent. However, the question remains, whether these results translate to a real setting and if it is possible to find patents previously overlooked or prevent the citation of actually irrelevant patents.

To get an estimate of how many of the cited, as well as the patents identified through our automated approach, are truly relevant for a given target patent, we asked a patent attorney to label a small subsample of the first dataset. As the patent attorney labelled these patents very carefully, her decisions merit a high confidence and we therefore consider them as the ground truth when her ratings are in conflict with the citation labels.

Using this second, more reliably labelled dataset, we first assess the amount of (dis)agreement between the cited/random labelling, based on the search reports, and the relevant/irrelevant labelling, obtained from the patent attorney. We then evaluate the similarity scores computed for this second dataset to see whether our automated approach is indeed capable of identifying the truly relevant prior art for a new patent application.

Comparing the current citation process to the additional human labels

To see if documents found in the search for prior art conducted by the patent office generally coincide with the documents considered relevant by our patent attorney, the confusion matrix as well as the correlation between the two human labellings is analysed. Please keep in mind that, in general, patent examiners can only assess the relevance of prior art that was actually found by the keyword driven search.

Taking the relevant/irrelevant labelling as the ground truth, the confusion matrix (Table 2) shows that 86 FP and 18 FN are produced by the patent examiner, which results in a recall of 0.78 and a precision score of 0.43. The large number of false positives can, in part, be explained by applicants being required by the USPTO to file so-called Information Disclosure Statements (IDS) including, according to the applicant, related background art [1]. The documents cited in an IDS are then included in the list of citations by the examiner, thus resulting in very long citations lists.

To get a better understanding of the relationship between the cosine similarity computed using BOW feature vectors and the relevant/irrelevant as well as the cited/random labelling, we calculate their pairwise correlations using Spearman’s $\rho$ (Table 3). The highest correlation score of 0.652 is reached between the relevant/irrelevant labelling and the cosine similarity, whereas Spearman’s $\rho$ for the cosine similarity and the cited/random labels is much lower (0.501).

When plotting the cosine similarity and the relevant/irrelevant labelling against each other for individual patents (e.g. Fig 3), in most cases, the scorings agree on whether a patent is relevant or not for the target patent. Yet it is worthwhile to inspect some of the outliers to get a better understanding of the process. In the Supporting Information D.2 we discuss two false positives, one produced by our approach and one found in a patent’s search report. More problematic, however, are false negatives, i.e., prior art that was missed when filing the application. For the target patent with ID US20150018885 our automated approach would have discovered a relevant patent, which was missed by the search performed by the patent examiner (Fig 3).

The patent with ID US20110087291 must be considered as relevant for the target patent, because both describe rigid bars that are aimed at connecting vertebrae for stabilization purposes with two anchors that are screwed into the bones. While in the target patent, the term bone anchoring member is used, the same part of the device in patent US20110087291 is called connecting member, which is a more abstract term. Moreover, instead of talking about a connecting bar, as it is done in the target patent, the term elongate fusion member is used in the other patent application.

Using full text similarity to identify relevant patents

In order to systematically assess how close the similarity score ranking can get to the one of the patent attorney (relevant/irrelevant) compared to the one of the patent office examiners (cited/random), the experiments performed on the first dataset with respect to the cited/random labelling were again conducted on this dataset subsample. For the analysis, it is important to bear in mind that this dataset is different from the one used in the previous experiments, as it only consists of the 450 patent pairs scored by the patent attorney. For each of the feature extraction methods, it was assessed how well the cosine similarity could distinguish between the relevant and irrelevant as well as the cited and random patent pairs of this smaller dataset.

The AUC and AP values achieved with the different feature representations on both labellings as well as, for comparison, on the original dataset, are reported in Table 4.

On this dataset subsample, the AUC w.r.t. the cited/random labelling is much lower than in the previous experiment on the larger dataset (0.806 compared to 0.956 for BOW features), which can be in part explained by the varying number of easily identifiable negative samples and their impact on the FPR: The full cited/random dataset contains many more low-scored random patents than the relevant/irrelevant subsample, where we included an equal amount of low- and high-scored random patents for each of the ten target patents. Yet, for most feature representations, the performance is better for the relevant/irrelevant than for the cited/random labelling of the dataset subsample, and the best results on the relevant/irrelevant labelling are achieved using the combination of BOW vectors and word2vec embeddings as feature vectors.

6 Discussion

The search for prior art for a given patent application is currently based on a manually conducted keyword search, which is not only time consuming but also prone to mistakes yielding both false positives and, more problematically, false negatives. In this paper, an approach for automating the search for prior art was developed, where a patent application’s full text is automatically compared to the patents contained in a database, yielding a similarity score based on which the patents can be ranked from most similar to least similar. The patents whose similarity scores exceed a certain threshold can then be suggested as prior art.

Several feature extraction methods for transforming documents into numerical vectors were evaluated on a dataset consisting of several thousand patent documents. In a first step, the evaluation was performed with respect to the distinction between cited and random patents, where cited patents are those included in the given target patent’s search report and random patents are randomly selected patent documents that were not cited by any of the target patents. We showed that by computing the cosine similarity between feature vectors created from full patent texts, we can reliably identify patents that occur in the search report of a target patent. The best distinction between these cited and random patents on the full corpus could be achieved when computing the cosine similarity using the well-established tf-idf BOW features, which is conceptually the method most closely related to a regular keyword search.

To examine the discrepancies between the computed similarity scores and cited/random labels, we obtained additional and more reliable labels from a patent attorney to identify truly relevant patents. As illustrated by Tables 3 and 4, the automatically calculated similarities between patents are closer to the patent attorney’s relevancy scoring than to the cited/random labellings obtained from the search report. The comparison of different feature representations on the smaller dataset not only showed that the same feature extraction method reaches different AUCs for the two labellings, but also that the feature extraction method that best distinguishes between cited and random patents on the full corpus (BOW) was outperformed on the relevant/irrelevant dataset by the combination of tf-idf BOW feature vectors with word2vec embeddings. This again indicates that the keyword search is missing patents that use synonyms or more general and abstract terms, which can be identified using the semantically meaningful representations learned by a NNLM. Therefore, with our automated similarity search, we are able to identify the truly relevant documents for a given patent application.

Most importantly, we gave an example where the cosine similarity caught a relevant patent originally missed by the patent examiner (Fig 3). As discussed at the beginning of this paper, missing a relevant prior art document in the search is a serious issue, as this might lead to an erroneous grant of a patent with profound legal and financial implications for both the applicant as well as competitors.

Consequently, our findings show that the search for prior art for a given patent application, and thereby the citation process, can be greatly enhanced by a precursory similarity scoring of the patents based on their full texts. With our NLP based approach we would not only greatly accelerate the search process, but, as shown in our empirical analysis, our method could also improve the quality of the results by reducing the number of omitted yet relevant documents.

Given the so far unsatisfying precision (0.43) and recall (0.78) values of the standard citation process compared to the relevancy labellings provided by our patent attorney, in the future it is clearly desirable to focus on improving the separation of relevant and irrelevant instead of cited and random patents. Our results on the small relevant/irrelevant dataset, while very encouraging, should only be considered as a first indicative step; clearly the creation of a larger dataset, reliably labelled by several experts, will be an essential next step for any further evaluation.

While we have demonstrated that our search approach is capable of identifying FP and FN w.r.t. the documents cited in a patent’s original search report, it is not clear whether this original search for prior art was always conducted using any of the more sophisticated IR approaches discussed in the related works section at the beginning of the paper, i.e., going beyond a basic manual keyword search. Therefore, a future step in the evaluation of our search approach would be to benchmark our methods against these existing IR techniques specifically developed for the prior art search, for example, using the CLEF-IP datasets [51, 53].

Furthermore, the methods discussed within this paper should also be applied to documents from other CPC classes to assess the quality of the automatically generated search results in domains other than medical or veterinary science and hygiene. Additionally considering the (sub)categories of the patents as features when conducting the search for prior art also seems like a promising step to further enhance the search results [69, 35].

It should also be evaluated how well these results translate to patents filed in other countries [52, 33], especially if these patents were automatically translated using machine translation methods [64, 16]. Here it may also be important to take a closer look at similarity search results obtained by using only the texts from single patent sections. As related work has shown [12, 15], an extended abstract and description may often suffice to find prior art. This can speed up the patent filing process, as all relevant prior art can already be identified early in the patent application process, thereby reducing the number of duplicate submissions with only revised (i.e. weakened) claims. However, as patents filed in different countries have different structures, these results might not directly translate to, e.g., patents filed with the European Patent Office.

It might also be of interest to compare other NNLM based feature representations for this task, e.g., by combining the word2vec embeddings with a convolutional neural network [7, 8]. To better adapt a similarity search approach to patents from other domains, it could also be advantageous to additionally take into account image based similarities computed from the sketches supplied in the patent documents [5, 32].

An important challenge to solve furthermore is how an exhaustive comparison of a given patent application to all the millions of documents contained in a real world patent database could be performed efficiently. Promising approaches for speeding up the similarity search for all pairs in a set [10] should be explored for this task in future work.

The search for a patent’s prior art is a particularly difficult problem, as patent applications are purposefully written in a way that is to create little overlap with other patents, as only by distinguishing the invention from others, a patent application has a chance of being granted [6]. By showing that our automated full text similarity search approach successfully improves the search for a patent’s prior art, consequently these methods are also promising candidates for enhancing other document searches, such as identifying relevant scientific literature.

Acknowledgements

This work was supported by the Federal Ministry of Education and Research (BMBF) for the Berlin Big Data Center BBDC (01IS14013A) and Berlin Center for Machine Learning BZML (01IS18037I), as well as the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (No. 2017-0-00451). Pfenning, Meinig & Partner mbB provided support in the form of salaries for authors FB and TO. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Author contributions statement

FH, FB, and KRM discussed and conceived the experiments, LH conducted the experiments, FB and TO labelled the subsample of the dataset. All authors wrote and reviewed the manuscript. Correspondence to LH, FH, and KRM.

Appendix A Supporting Information: Methods

A.1 Feature representations of text documents

Tf-Idf BOW features

Given $D$ documents with a vocabulary of size $L$ , each text is transformed into a bag-of-words (BOW) feature vector $\mathbf{x}_{k}\in\mathbb{R}^{L}\;\forall k\in 1...D$ by first computing a normalized count, the term frequency (tf), for each word in a text, and then weighting this by the word’s inverse document frequency (idf) to reduce the influence of very frequent but inexpressive words that occur in almost all documents (such as ‘and’ and ‘the’) [42]. The idf of a term $w$ is calculated as the logarithm of the total number of documents, $|D|$ , divided by the number of documents that contain term $w$ , i.e.

[TABLE]

The entry corresponding to the word $w$ in the feature vector $\mathbf{x}_{k}$ of a document $k$ is then

[TABLE]

Instead of using the term frequency, a binary entry in the feature vector for each word occurring in the text might often suffice. Furthermore, the final tf-idf vectors can be normalized by dividing them e.g. by the maximum or the length of the respective vector:

[TABLE]

LSA and KPCA

Transforming the documents in the corpus into BOW vectors leads to a high-dimensional but sparse feature matrix. These feature representations can be reduced to their most expressive dimensions, which helps to reduce noise in the data and create more overlap between vectors. For this, we experiment with both latent semantic analysis (LSA) [27] and kernel principal component analysis (KPCA) [59].

LSA represents a word’s meaning as the average of all the passages the word appears in, and a passage, such as a document, as the average of all the words it contains. Mathematically, a singular value decomposition (SVD) of the BOW feature matrix $X\in\mathbb{R}^{D\times L}$ for the respective corpus is performed. The original data points can then be projected onto the vectors corresponding to the $l$ largest singular values of matrix $X$ , yielding a lower-dimensional representation $\hat{X}\in\mathbb{R}^{D\times l}$ , where $l<L$ . Choosing a dimensionality $l$ that is smaller than the original dimension $L$ is assumed to lead to a deeper abstraction of words and word sequences and to give a better approximation of their meaning [27].

Similarly, KPCA [59, 48] performs an SVD of a linear or non-linear kernel matrix $K\in\mathbb{R}^{D\times D}$ to obtain a low dimensional representation of the data, again based on the eigenvectors corresponding to the largest eigenvalues of this matrix. While we have studied different Gaussian kernels, we found that good results could already be obtained using the linear kernel $K=XX^{\top}$ .

When reducing the dimensionality of the BOW feature vectors with LSA and KPCA, four embedding dimensions (100, 250, 500 and 1000) were tested and the best performance on the full texts was achieved using 1000 dimensions. As the dataset subsample contains only 450 patent pairs, here the best results with LSA and KPCA were achieved using only 100 dimensions.

Combining BOW features with word2vec embeddings

One shortcoming of the BOW vectors is that semantic relationships between words, such as synonymy, as well as word order, are not taken into account. This is due to the fact that each word is associated with a single dimension in the feature vector and therefore the distances between all words are equal. The aspect of synonymy is especially relevant for patent texts, where very abstract and general terms are used for describing an invention in order to assure a maximum degree of coverage. For instance, a term like fastener might be preferred over the usage of the term screw, as it includes a wider range of material and therefore gives a better protection against infringement. Thus, patent texts tend to contain neologisms and abstract words that might even be unique in the corpus. To account for this variety in a keyword search is especially tedious and prone to errors as the examiner has to search for synonyms at different levels of abstraction or rely on a thesaurus, which would then need to be kept up-to-date [72]. Even the BOW approach could in this case only capture the similarity between the patent texts if there is overlap between the words in the context around a synonym. An approach specifically developed to overcome these restrictions are neural network language models (NNLM) [11], which aim at representing words or documents by semantically meaningful vectorial embeddings.

A NNLM that recently received a lot of attention is word2vec. Its purpose is to embed words in a vector space based on their contexts, such that terms appearing in similar contexts are close to each other in the embedding space w.r.t. the cosine similarity [44, 45, 22]. Given a text corpus, the word representations are obtained by training a neural network that learns from the local contexts of the input words in the corpus. The embedding is then given by the learned weight matrix. Mikolov et al. [44] describe two different network architectures for training the word2vec model, namely the continuous bag-of-words (CBOW) and the skip-gram model. The first one learns word representations by predicting a target word based on its context words and the latter one by predicting the context words for the current input word. As the skip-gram model showed better performance in analogy tasks [44, 45, 46] it is used in this paper.222Analogy tasks aim at finding relations such as A is to B as C is to $\_\_$ . For instance, in the relation good is to better as bad is to $\_\_$ , the correct answer would be worse.

To make use of the information learned by the word2vec model for each word in the corpus vocabulary $L$ , the trained word embeddings have to be combined to create a document vector for each patent text. To this end, the dot product of each document’s BOW vector with the word embedding matrix $W\in\mathbb{R}^{L\times r}$ , containing one $r$ -dimensional word embedding per row, is calculated. For each document represented by a BOW vector $\mathbf{x}_{k}\in\mathbb{R}^{L}$ , this results in a new document vector $\tilde{\mathbf{x}}_{k}\in\mathbb{R}^{r}$ , which corresponds to the sum of the word2vec embeddings of the terms occurring in the document, weighted by their respective tf-idf scores. Combining the BOW vectors and the word embeddings thus comes along with a dimensionality reduction of the document vectors, while their sparseness is lost.

For the word2vec model we use a standard setting from the literature (i.e. the embedding dimension $r$ was set to $200$ , the window size $c$ as well as the minimum frequency to 5 and negative sampling was performed using 13 noise words) [44, 45].

Doc2vec representations

With doc2vec, Le and Mikolov [28] extend the word2vec model to directly represent word sequences of arbitrary lengths, such as sentences, paragraphs or even whole documents, by vectors. To learn the representations, word and paragraph vectors are trained simultaneously for predicting the next word for different contexts of fixed size sampled from the paragraph such that, at least in small contexts, word order is taken into account. Words are mapped to a unique embedding in a matrix $W\in\mathbb{R}^{L\times r}$ and paragraphs to a unique embedding in a matrix $P\in\mathbb{R}^{D\times r}$ . In each training step, paragraph and word embeddings are combined by concatenation to predict the next word given a context sampled from the respective paragraph. After training, the doc2vec model can be used to infer the embedding for an unseen document by performing gradient descent on the document matrix $P$ after having added more rows to it and holding the learned word embeddings and softmax weights fixed [28].

For the doc2vec model, we explored the parameter values 50, 100, 200 and 500 for the embedding dimension $r$ of the document vectors on the cited/random dataset in preliminary experiments, with the best results achieved with $r=50$ . The window size was set to 8, the minimum word count to 5, and the model was trained for 18 iterations. When training the model, the target patents were excluded from the corpus to avoid overfitting. Their document vectors were then inferred by the model given the learned parameters before computing the similarities to the other patents.

A.2 Functions for measuring similarity between text documents

Transforming the patent documents into numeric feature vectors allows to assess their similarity with the help of mathematical functions. Rieck and Laskov [55] give a comprehensive overview on vectorial similarity measures for the pairwise comparison of sequential data. These can be divided into three main categories, namely kernels, distance functions, and similarity coefficients. Their formulas are shown in Table 5 and the notation is consistent with the one in the paper. Here, $w$ corresponds to a word in the vocabulary $L$ of the corpus, and $\Phi_{w}\left(x\right)$ maps each word $w\in L$ to its normalized and weighted count in sequence $x$ , i.e. to its tf-idf value. The similarity functions will be briefly described in the following, while further details can be found in the original publication [55].

The general idea for the comparison of two sequences is that the more overlap they show with respect to their subsequences, the more similar they are. When transforming texts into BOW features, a subsequence corresponds to a single word. Two sequences $x_{i}$ and $x_{j}$ can thus be compared based on the normalized and weighted counts of the subsequences stored in the respective feature vectors $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ .

Kernel functions

The first group of similarity measures Rieck and Laskov [55] discuss are kernel functions. They implicitly map the feature vectors into a possibly high or even infinite dimensional feature space, where the kernel can be expressed as a dot product. A kernel $k$ thus has the general form

[TABLE]

where $f$ maps the vectors into the kernel feature space. The advantage of the kernel function is that it avoids the explicit calculation of the vectors’ high dimensional mapping and allows to obtain the result in terms of the vectors’ representation in the input space instead [58, 57].

Distance functions

The distance functions described in Rieck and Laskov [55] are so-called bin-to-bin distances [50]. This means that they compare each component of the vector to its corresponding component in the other one, e.g. by subtracting the respective word counts and summing the subtractions for all words in the vocabulary. Unlike similarity measures, the distance measures are higher the more different the compared sequences are but can be easily transformed into a similarity measure by multiplying the result with $-1$ , for example.

Similarity coefficients

Similarity coefficients were designed for the comparison of binary vectors and, instead of expressing metric properties, they assess similarity by comparing the number of matching components between two sequences. More precisely, for calculating the similarity of two sequences $x_{i}$ and $x_{j}$ , they use three variables a, b and c, where a corresponds to the number of components contained in both $x_{i}$ and $x_{j}$ , b to the number of components contained in $x_{i}$ but not in $x_{j}$ , and c to the number of components contained in $x_{j}$ but not in $x_{i}$ . In the case of BOW vectors, which are not inherently binary, the three variables can be expressed as follows:

[TABLE]

Appendix B Supporting Information: Data

Patent corpus

To evaluate the different methods for computing document similarities on real world data, an initial patent corpus was obtained from a patent database. This corpus consists of over 100,000 patent grants and applications published at the United States Patent and Trademark Office (USTPO) between 2000 and 2015.

We create such a patent corpus (by crawling Google Patents333https://www.google.de/patents) as illustrated in Fig 4. To get a more homogeneous dataset, only patents of the category A61 (medical or veterinary science and hygiene) according to the Cooperative Patent Classification scheme (CPC) were included in our corpus. Another important criterion for including a patent document in our initial patent corpus was that its search report, i.e. the prior art cited by the examiner, had to be available from the database. Starting with 20 manually selected (randomly chosen) seed patents published in 2015, the patent corpus was iteratively extended by including the seed patents’ citations if they were published after 1999 and belonged to the category A61. The citations of these patents were then again checked for publication year and category and included if they fulfilled the respective conditions.

Structure of the crawled dataset

Comparing the distribution of patents published per year in the dataset and the total amount of patents filed between 2000 and 2015 at the USTPO (Fig 5), it can be seen that the distribution in the dataset is not representative. The peak in 2003 and the fact that there are less and less patents with a publication date in the following years is most probably a result of the crawling strategy. Given that we started with some patents filed in 2015 and then subsequently crawled the citations, published in the past, explains the low amount of patents published in more recent years in the dataset.

The same holds for the subcategory distribution displayed in Fig 6. While the most prominent subcategory in our dataset is A61B, the most frequent subcategory is actually A61K. The bias for subcategory A61B is due to the fact that several seed patents belonged to it.

Finally, to get some insights into the existing search for prior art, we examine the distribution of the number of citations in the patent dataset. The citation counts for a subsample of 5000 randomly selected patents show that the distribution follows Zipf’s law with many patents having very few citations and a low number of patents having many citations (Fig 7).

Structure of a patent

The requirements regarding the structure of a patent application are very strict and prescribe the presence of certain sections and what their content should be. For the automated comparison of texts it can be interesting to have a closer look at the different sections of the documents as it might, for instance, be sufficient to only compare a specific section of the texts. This can on the one hand be useful to perform a preliminary search for prior art before the patent text is written in its entirety in order to prevent unnecessary work and on the other hand, it can help to decrease the computational burden of preprocessing and comparing full texts.

The Patent Cooperation Treaty (PCT) by the World Intellectual Property Organization (WIPO) defines several obligatory sections a patent application must contain.444The WIPO is an agency of the United Nations with the aim of unifying and fostering the protection of intellectual property. According to their requirements, a patent application should consist of a title, an abstract, the claims, and the description, where the invention is thoroughly described and the figures included in the document are explained in depth. Similar to scientific publications, a patent’s abstract consists of a short summary of what the invention is about. The claims section plays a very special role in a patent application, as it defines the extent of the protection the patent should guarantee for the invention and is therefore the section the patent attorneys and patent officers base their search for prior art on. If the claims enter in conflict with already existing publications, they can be edited by weakening the protection requirements, which is why this section is reformulated the most during the possibly multiple stages of a patent process.

As both the USTPO and the European Patent Office (EPO) adopt the PCT, the required sections are the same in the United States and in Europe. Nonetheless, some differences in the length of the description section can be observed. For a patent application handed in at the USTPO, this section mostly consists of the figures’ descriptions, while for applications to the EPO it contains more abstract descriptions of the invention itself. This is due to stricter requirements of consistency between claims and description for European patents and must be taken into account when patents filed at different offices are compared, as this might result in lower similarity scores [52, 33].

Constructing a labelled dataset with cited and random patents

A first labelled dataset was constructed from the patent corpus by pairing up the patents and labelling each pair depending on whether or not one patent in the pair is cited by the other. More formally, let $\mathcal{P}$ be the set of patents in the corpus and $\mathcal{P}^{2}$ its Cartesian product. Each patent pair $(p_{1},p_{2})\in\mathcal{P}^{2}$ then gets assigned the label $1$ (cited) if $p_{2}$ is contained in the search report of patent $p_{1}$ and [math] (random) otherwise. As some of the tested approaches are computationally expensive, we did not pair up all of the 100,000 documents in the corpus. Instead, the roughly 2,500 patents published in 2015 contained in the corpus were selected as a set of target patents and paired up with their respective citations as well as with a set of 1,000 randomly selected patents that were not contained in the search reports of any of the target patents.

Due to divisional applications and parallel filings and because claims are often changed during the application process, patents with the same description may appear several times with different IDs, which is why, as a sanity check, duplicates for some of the target patents were included in the dataset as well.555Duplicates are expected to receive a similarity score near or equal to 1. All together, this ‘cited/random’ labelled dataset consists of 2,470,736 patent pairs, of which 41,762 have a citation, 2,427,000 a random, and 1974 a duplicate relation.

Obtaining relevancy labels from a patent attorney

As a subsample of the first dataset, our second dataset was constructed by taking ten of the target patents published in 2015, as well as their respective cited patents. In addition to that, in order to assess if relevant patents were missing from the search report, some of the random patents were included as well. These were selected based on their cosine similarity to the target patent, computed using the BOW vector representations. We chose for each patent the ten highest-scored, ten very low-ranked, and ten mid-ranked random patents. In total, this dataset subsample consists of 450 patent pairs, of which 151 are citations and 299 random pairs.

Neither knowing the similarity score of the patent pairs nor which ones were cited or random patents, the patent attorney manually assigned a score between 0 and 5 to the patent pairs according to how relevant the respective document was considered for the target patent, thus yielding the second labelled dataset. For most of the following evaluation, the patent attorney’s scoring was transformed into a binary labelling by considering all patent pairs with a score greater than $2$ as relevant and the others as irrelevant.

Appendix C Supporting Information: Evaluation

Computing AUC scores to evaluate similarity measures

When computing similarity scores for all patent pairs, this results in two distributions of similarity scores, one for the positive samples (pairs of patents where one patent was cited by the other) and one for the negative samples (random patents). Ideally, these two distributions of scores would be separated, such that it is easy to chose a threshold to identify a positive or negative sample based on the corresponding similarity score of the patent pair (Fig 8). To measure how well these two distributions are separated, we can compute the area under the receiver operating characteristic (ROC) curve. Every possible threshold value chosen for separating positive from negative examples can lead to some pairs of unrelated patents to be mistakenly considered as relevant, what is called false positives (FP), or to pairs of related patents mistakenly regarded as irrelevant, so-called false negatives (FN). Correct decisions are either true negatives (TN), i.e., a pair of random patents that was correctly considered as irrelevant, or true positives (TP), which are correctly detected cited patents. Based on this, for every threshold value we can compute the true positive rate (TPR), also called recall, the false positive rate (FPR), and the false negative rate (FNR) to set wrong and correct decisions into relation:

[TABLE]

By plotting the TPR against the FPR for different decision similarity score thresholds, we then obtain the graph of the ROC curve, where the area under the ROC curve (AUC) conveniently translates the performance of the similarity measure into a number between $0.5$ (no separation between distributions) and $1$ (clear distinction between positive and negative samples), as shown in Fig 8.666Many information retrieval applications use precision and recall to measure the system’s performance by comparing the number of relevant documents to the number of retrieved documents. However, since we do not only want to retrieve relevant documents, but in general select a discriminatory, interpretable, and meaningful similarity score, we consider the AUC, which relates the system’s recall to its FPR.

Appendix D Supporting Information: Results

D.1 Identifying cited patents using different similarity functions with BOW features

We evaluated all similarity measures listed in Table 5 using BOW features on the cited/random corpus. When computing the BOW features, we either used the term frequency (tf) or a binary flag ( $0/1$ ) for each word occurring in a document and experimented with raw values as well as values weighted by the words’ idf scores. Furthermore, these feature vectors were either normalized by the vector’s maximum value or its length. The AUC scores for all these combinations can be found in Table 7.

For all similarity functions (excluding the Minkowski distance) the best result is obtained when using either tf (distance functions) or tf-idf (kernel functions, similarity coefficients, as well as Canberra and Euclidean distance) feature vectors. This shows that it is important to consider how often each term occurs in the documents instead of only encoding its presence or absence. Another observation that can be made is that the majority of the highest AUC scores is obtained on the tf-idf feature vectors, which give a more accurate insight on how important each term actually is for the given document and reduce the importance of stop words. Except for the Chebychev distance, the final normalization of the vectors should be performed using their lengths and not their maximum values. This might be due to the fact that the length normalization takes all the vector entries into account and not only the highest one, which makes it less sensitive to outliers, i.e. extremely high values in the vector. With length normalized vectors as input, the linear kernel is equal to the cosine similarity and can thus be included into the group of similarity coefficients.

All in all, except for the Euclidean distance, which gives the same AUC as the cosine similarity using normalized vectors, the kernel functions and similarity coefficients yield much better results than the distance measures, which shows that it is more important to focus on words the texts have in common instead of calculating their distance in the vector space. Among similarity coefficients and kernel functions, the former function class gives slightly more robust results. Given that similarity coefficients are especially designed for sequence comparison by explicitly taking into account their subsequences’ overlap, they seem to be the appropriate function class for measuring similarity between the BOW feature vectors.

The cosine similarity is widely used in information retrieval [14, 23, 9] and is well suited to distinguish between cited and random patents as it assigns lower scores to random than to cited patent pairs and, additionally, reliably detects duplicates by assigning them a score near or equal to $1$ (Fig 2).

D.2 Detailed examination of outliers in the citation process

For a better understanding of the disagreements between the cited/random labelling and the cosine similarity scores compared to the relevant/irrelevant labelling, we take a closer look at a FP yielded by the cosine similarity as well as a FP yielded by both, the cosine similarity and the cited/random labelling. In addition to that, in the main text we gave an example of a FN, i.e. a relevant patent that was missed by the patent examiner, but would have been found by our automated approach, as it received a high similarity score.

False positive yielded by our automated approach

The patent with ID US7585299777http://www.google.de/patents/US7585299 marked with a gray circle in Fig 9 on the left would correspond to a FP taking both human labellings as the ground truth, because it received a high cosine similarity score although being neither relevant nor a citation.

The target patent (ID US20150066086888http://www.google.de/patents/US20150066086) as well as the patent with ID US7585299 describe inventions that stabilize vertebrae. In the target patent, the described device clamps adjacent spinous processes together by two plates held together by two screws without introducing screws inside the bones. The device described in patent US7585299, in contrast, stabilizes the spine using bone anchors, which are screwed e.g. into the spinous processes or another part of the respective vertebrae and which have a clamp on the opposite end. The vocabulary in both patents is thus extremely similar, which leads to a high overlap on the BOW vector level, however, the two devices are far too different to be considered as similar inventions given that one is rigid and screwed into the bones whereas the other one only clamps the spinous processes and thereby guarantees a certain degree of flexibility.

False positive yielded by our automated approach and the cited/random labelling

For other target patents, more discordance with respect to the relevance of the other patents can be observed, also between the two human ratings. The correlation of the relevant/irrelevant scoring for the patent with ID US20150066087999http://www.google.de/patents/US20150066087 in Fig 9 on the right shows that there are many cited patents that received a rather low score by the patent attorney, which means that the patent examiner produced a considerable amount of FP. One possible explanation for this might be that the patent examiners tend to produce rather more than less citations and thus include a large amount of the patents that are returned as results for their keyword query into the search report, although, on closer inspection, the relevance for the target patent is unfounded. This is also due to the fact that they mostly base their search on the claims section, which is usually kept as general as possible to guarantee a maximum degree of protection for the invention. The analysis of the FP with ID US20130079880101010http://www.google.de/patents/US20130079880 (marked by the gray circle in the plot) underpins this hypothesis. The claims sections of the two patents are similar and the devices described in the patents are of similar construction, both having plates referred to as wings. The device described in the target patent, however, is designated to immobilize adjacent spinous processes, whereas the one described in patent US20130079880 is aimed at increasing the space between two adjacent vertebrae to relieve pressure caused for instance by dislocated discs. Especially the similar claims section might have led the patent examiner to cite the patent, although the devices clearly have different purposes, which can easily be derived from their descriptions.

Bibliography73

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1usp [2018] Information Disclosure Statements , chapter 609. United States Patent and Trademark Office, 2018.
2Achakulvisut et al. [2016] Titipat Achakulvisut, Daniel E Acuna, Tulakan Ruangrong, and Konrad Kording. Science concierge: A fast content-based recommendation system for scientific publications. PLOS ONE , 11(7):e 0158423, 2016.
3Achananuparp et al. [2008] Palakorn Achananuparp, Xiaohua Hu, and Xiajiong Shen. The evaluation of sentence similarity measures. In Proceedings of the 10th International Conference on Data Warehousing and Knowledge Discovery , Da Wa K ’08, pages 305–316, Berlin, Heidelberg, 2008. Springer-Verlag.
4Alberts et al. [2011] Doreen Alberts, Cynthia Barcelon Yang, Denise Fobare-De Ponio, Ken Koubek, Suzanne Robins, Matthew Rodgers, Edlyn Simmons, and Dominic De Marco. Introduction to patent searching , pages 3–43. Springer, 2011.
5Alberts et al. [2017] Doreen Alberts, Cynthia Barcelon Yang, Denise Fobare-De Ponio, Ken Koubek, Suzanne Robins, Matthew Rodgers, Edlyn Simmons, and Dominic De Marco. Introduction to Patent Searching , chapter 1, pages 3–45. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017. ISBN 978-3-662-53817-3.
6Andersson et al. [2017] Linda Andersson, Allan Hanbury, and Andreas Rauber. The Portability of Three Types of Text Mining Techniques into the Patent Text Genre , chapter 9, pages 241–280. Springer Berlin Heidelberg, Berlin, Heidelberg, 2017. ISBN 978-3-662-53817-3.
7Arras et al. [2016] Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. Explaining Predictions of Non-Linear Classifiers in NLP. In Proceedings of the 1st Workshop on Representation Learning for NLP , pages 1–7, Berlin, Germany, August 2016. Association for Computational Linguistics.
8Arras et al. [2017] Leila Arras, Franziska Horn, Grégoire Montavon, Klaus-Robert Müller, and Wojciech Samek. "what is relevant in a text document?": An interpretable machine learning approach. Plo S one , 12(8):e 0181142, 2017.