From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

Diego Rossini; Lonneke van der Plas

arXiv:2605.06426·cs.CL·May 8, 2026

From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

Diego Rossini, Lonneke van der Plas

PDF

1 Repo

TL;DR

This paper introduces a scalable, modular pipeline combining rule-based filtering and LLM classification to detect neologisms from a massive Reddit corpus, resulting in a verified set of 599 genuine lexical innovations.

Contribution

It presents a novel large-scale, modular pipeline for automatic neologism detection that integrates linguistic frameworks with machine learning, and provides a publicly available implementation.

Findings

01

Extracted 124.6 million tokens from Reddit posts and identified 1,021 neologism candidates.

02

Manual verification confirmed 599 (58.7%) candidates as genuine neologisms.

03

Multiple LLMs showed significant disagreement, highlighting challenges in large-scale neologism detection.

Abstract

We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English-language Reddit posts spanning 2005-2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross-model disagreement and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DiegoRossini/neologism-pipeline
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.