LLM generation novelty through the lens of semantic similarity
Philipp Davydov, Ameya Prabhu, Matthias Bethge, Elisa Nguyen, Seong Joon Oh

TL;DR
This paper introduces a semantic similarity-based framework to measure generation novelty in large language models, revealing insights into data usage, task influence, and tuning effects on novelty.
Contribution
It proposes a novel semantic retrieval approach for evaluating LLM generation novelty, enabling large-scale analysis and revealing new findings about model behavior.
Findings
Models utilize longer sequences from pretraining than previously known.
Task domains influence the level of generation novelty.
Instruction tuning increases the novelty of generated outputs.
Abstract
Generation novelty is a key indicator of an LLM's ability to generalize, yet measuring it against full pretraining corpora is computationally challenging. Existing evaluations often rely on lexical overlap, failing to detect paraphrased text, or do not consider the full pretraining corpus. We frame novelty as a semantic retrieval problem. This framing enables us to address novelty with modern embedding and indexing pipelines, allowing for efficient analysis at pre-training scale. Specifically, we propose a three-stage framework that retrieves semantically similar samples, reranks them at varying subsequence lengths, and calibrates scores using a human novelty reference for interpretability. We apply this framework to the SmolLM model family and report three key findings: (1) models draw on pre-training data across much longer sequences than previously reported; (2) some task domains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Text Readability and Simplification · Natural Language Processing Techniques
