The Last Fingerprint: How Markdown Training Shapes LLM Prose
E. M. Freeburg

TL;DR
This paper investigates how Markdown training influences large language models' use of em dashes, revealing that em dash frequency is a signature of training data and fine-tuning procedures.
Contribution
It establishes a mechanistic link between Markdown training, structural internalization, and em dash usage, providing a diagnostic tool for fine-tuning effects.
Findings
Em dash frequency varies widely across models, from 0.0 to 9.1 per 1,000 words.
Models instructed to avoid Markdown still produce em dashes, indicating a latent tendency.
Meta's Llama models do not produce em dashes, showing differences in fine-tuning.
Abstract
Large language models produce em dashes at varying rates, and the observation that some models "overuse" them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose -- the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
