Tracing the Genealogies of Ideas with Large Language Model Embeddings
Lucian Li

TL;DR
This paper introduces a new computational method using large language model embeddings to trace intellectual influence and idea evolution across extensive textual corpora, capturing semantic and structural similarities.
Contribution
The paper presents a novel ensemble approach combining semantic and structural embeddings to detect ideas and influence in large, diverse textual datasets, including 19th-century publications.
Findings
Effective detection of ideas across 400,000 texts
Capable of identifying Darwinian influence in texts
Robust to paraphrasing and structural variations
Abstract
In this paper, I present a novel method to detect intellectual influence across a large corpus. Taking advantage of the unique affordances of large language models in encoding semantic and structural meaning while remaining robust to paraphrasing, we can search for substantively similar ideas and hints of intellectual influence in a computationally efficient manner. Such a method allows us to operationalize different levels of confidence: we can allow for direct quotation, paraphrase, or speculative similarity while remaining open about the limitations of each threshold. I apply an ensemble method combining General Text Embeddings, a state-of-the-art sentence embedding method optimized to capture semantic content and an Abstract Meaning Representation graph representation designed to capture structural similarities in argumentation style and the use of metaphor. I apply this method to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods
