How Accurate Are My AI Patent Search Results? The Importance of Measuring Precision and Recall

Artificial Intelligence is Upon Us

The patent information community is witnessing the development of powerful machine learning algorithms (AI-driven software) designed to revolutionize patent searching. The tools are new, but have improved the quality and comprehensiveness of prior art searches that are done by human analysts.

This is a welcome development as no prior art search is complete and most cite only the best available relevant art.

None is complete because the prior art searching depends on the limited capacity of humans to find and sift through numerous records from a body of millions and to assess technological relevance under time constraints. This is easily proven by giving two independent patent analysts an identical search request. Each will cite a different set of prior references, with some crossover, nearly every time.

The best available references are cited because they reflect only the specific databases that were searched, in the specific language searched, and what the analyst believes will satisfy the request in the time given.

Algorithms that Learn

By contrast, software algorithms have the capacity to sift through millions of records, process information faster than the human brain, work continuously, and never become exhausted.

As trained patent search algorithms, these inexhaustible programs conduct millions of mathematical calculations to determine the relevancy of prior art. They compare the meaning of words and topics in a document and determine how closely a prior art reference within its dataset matches the invention disclosure or the patent upon which the search is based. Then, they are refined to provide more accurate results as they iterate over a known dataset.

Let Us Give Them the Kitchen Sink…

Although a machine learning algorithm is faster, it is not truly intelligent like a brain and will generate non-relevant results more often than a human analyst. This may change as patent search algorithms improve, but it represents an obstacle to adoption.

Despite their promise, none of the commercially available AI-based patent search systems publish meaningful metrics of their performance. They instead cite too many patent references, most of which are not relevant, in the hope of capturing all relevant references. By doing so they supply large numbers of non-relevant prior art that a human analyst is compelled to review for accuracy (a.k.a. noise).

If a machine-learning algorithm cites 100 references and expects the patent analyst to read, review, and judge the relevancy of all 100 of them, the software is not an improvement over any of the search methods and software used currently.

Shouldn’t the system only cite relevant references and not cite non-relevant references?  If not, how is the AI system any better than traditional search methods?

Then Confuse Them With Relative Performance…

After “returning a wide net” (citing too much noise) and not classifying relevant from non-relevant references, some AI patent search software providers then tag them with inadequate measures of relevancy. For example, they may indicate relevancy on a continuum of 0.0 to 1.0 from lowest to highest level of relevance.

If the highest rated cited reference is .75 does that mean it is relevant? What about a rating of .63 or .50 or .47? No guidance is given. The user will need to study the referenced patent to decide for themselves.

Often, as Ensemble IP tests a system, we will be given a score of say .75 for a cited reference whose text is identical to the input text when it should be 1.0 (because it is the exact same document). Why wouldn’t the rating system produce a 1.0 when it finds an exact match? Is such a rating system a work-in-progress, or is it just the wrong approach? We think the latter.

In fact. such metrics are merely a proxy for relative relevancy. They tell you nothing about the individual cited art except that the higher the rating, the better it compares to lower scored documents in the reference set. You still need to read the patent and maybe even every patent cited by the software. You are required to determine the cutoff point as the software is unhelpful.

…And Say Nothing About Missing Relevant Art

Shouldn’t performance measures predict the probability that relevant art was not even retrieved by the AI software system? Admittedly, this a difficult problem because two or more capable analysts can disagree on relevancy, not all of the world’s data are available for machine learning, and the same search project is not usually repeated.

So any predictive metric will be inexact. Ratings might need to be stated as a range of probabilities or qualified for certain search types of technologies. In our expert view, an AI patent search system should alert the user that some relevant art may not have been uncovered with machine learning.

As users of AI search systems, we do not believe that providers intend harm by “giving us the kitchen sink, confusing us with relative performance, and saying nothing about missing relevant art.” Indeed we believe it is very difficult to provide meaningful performance measures in a subjective field with few repeated search requests, and the AI software providers would likely agree.

Better Measures of Performance

Therefore, we propose the use of two measures that are well-understood and practiced in the data science and scientific communities. They are measures of precision and recall. They can be understood by creating and inspecting a table known as a Confusion Matrix with a sample of search results.

The confusion matrix below categorizes a search result whereby 20 patents were predicted relevant and actually were relevant. Ten (10) patents were also cited as relevant, but were not-relevant, and forty (40) patents were not cited as not-relevant, but were relevant. The remaining patents in the patent corpus (thousands of them) were predicted as not-relevant and were, in fact, not relevant.

  Actual Relevance
 Total PopulationRelevant
(Deemed relevant)
Not relevant
(Deemed not relevant)
Predicted RelevanceRelevantTrue Positives
Predicted relevant and it is relevant according to most experts, the patent office or the courts
False Positives
Predicted relevant, but it is not relevant according to most patent experts, the patent office, or the courts
(Type 1 error)
Not relevantFalse Negatives
Predicted not relevant, but it is relevant according to most experts, the patent office, or the courts
(Type 2 error)
True Negatives
Predicted not relevant and it is not relevant according to most experts, the patent office, or the courts
All the remaining patent documents in the database

From this matrix, we have calculated precision and recall. Precision measures how many of the patents in your basket that you believe are relevant really are relevant. If the search were completely precise it would receive a score of 1.0. In other words, it would have NEVER cited non-relevant prior art references.

Recall measures how many of relevant patents were identified in the first place. In other words, it looks at the gems in the basket compared to the gems you failed you put in the basket. If the search has complete recall it would receive a score of 1.0. In others words it would have cited ALL relevant prior art references.

How to Calculate Precision and Recall


Precision is the percentage of patents that were retrieved that are relevant. For example, if an AI search system retrieves 30 patents and 20 of them are relevant then the precision is .67 (2 of 3 patents cited are truly relevant). The 20 patents are “true positives” (cited as relevant and truly relevant). The other 10 patents are “false positives” (predicted as relevant but not relevant).


Recall is the percentage of relevant patents that were actually retrieved. For example, if that same AI search system failed to cite 40 other patents that were relevant it recalled only 20 of 60 relevant patents. The recall is 20/60 or .33. In this way, recall measure how complete the results are.


A perfect patent search would cite only relevant art (perfect precision = 100%) and all relevant art (perfect recall = 100%). Of course, perfection is nearly impossible but should be the objective of AI patent search software tools and patent analysts who practice the profession.

Our opinion is that lasting innovations in AI prior art searching would result in the accelerated retrieval of relevant prior art references, cite mostly relevant references, and cite a more complete set of relevant references. They would also credible, informative, and well-accepted measures to assess the performance of their algorithms.

The best place to start is with the measurements of precision and recall. These metrics will give software providers the correct objectives and will challenge their ability to apply machine learning for patent searching. It will be especially challenging to predict recall given the universe of unknown references; but without it, these tools will fail to achieve their promise.

From the confusion matrix, much analysis can be done to measure the accuracy of an algorithm. It is beyond the scope of this paper to discuss those types of measurements, but the confusion matrix is one of the most important diagrams in machine learning and predictive analytics.

Dave is the CEO and a founding shareholder at Ensemble IP. Prior to joining the firm, he owned and led Landon IP, Inc., a professional patent search, analytics, and training company with hundreds of employees at operating offices in the United States, Japan, United Kingdom, Germany, India, and China. Dave is a graduate of the College of William & Mary (B.A., ‘87, M.B.A, ‘93) and the Harvard Business Analytics Program (’19). He is currently pursuing a Master of Science in Data Science at Northwestern University. Dave speaks English and Japanese.

Get started

Share your details below and either Ted Klekman or Tom Burkland will reach out to you to answer your questions or provide a no-obligation quote for your search project.