Large Language Models for Patent Classification: Strengths, Trade-offs, and the Long Tail Effect
Lorenzo Emer, Marco Lippi, Andrea Mina, Andrea Vandin

TL;DR
This study compares encoder-based models and large language models for patent classification, revealing their complementary strengths and trade-offs, especially in handling rare categories and balancing efficiency with coverage.
Contribution
It provides a systematic evaluation of LLMs versus encoder models on imbalanced patent data, highlighting the potential of hybrid approaches for improved classification.
Findings
Encoder models excel on frequent subclasses.
LLMs perform better on infrequent subclasses.
Encoder models are significantly more energy-efficient.
Abstract
Patent classification into CPC codes underpins large scale analyses of technological change but remains challenging due to its hierarchical, multi label, and highly imbalanced structure. While pre Generative AI supervised encoder based models became the de facto standard for large scale patent classification, recent advances in large language models (LLMs) raise questions about whether they can provide complementary capabilities, particularly for rare or weakly represented technological categories. In this work, we perform a systematic comparison of encoder based classifiers (BERT, SciBERT, and PatentSBERTa) and open weight LLMs on a highly imbalanced benchmark dataset (USPTO 70k). We evaluate LLMs under zero shot, few shot, and retrieval augmented prompting, and further assess parameter efficient fine tuning of the best performing model. Our results show that encoder based models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntellectual Property and Patents · Big Data and Digital Economy · Machine Learning in Materials Science
