The Last Fingerprint: How Markdown Training Shapes LLM Prose

E. M. Freeburg

arXiv:2603.27006·cs.CL·April 1, 2026

The Last Fingerprint: How Markdown Training Shapes LLM Prose

E. M. Freeburg

PDF

TL;DR

This paper investigates how Markdown training influences large language models' use of em dashes, revealing that em dash frequency is a signature of training data and fine-tuning procedures.

Contribution

It establishes a mechanistic link between Markdown training, structural internalization, and em dash usage, providing a diagnostic tool for fine-tuning effects.

Findings

01

Em dash frequency varies widely across models, from 0.0 to 9.1 per 1,000 words.

02

Models instructed to avoid Markdown still produce em dashes, indicating a latent tendency.

03

Meta's Llama models do not produce em dashes, showing differences in fine-tuning.

Abstract

Large language models produce em dashes at varying rates, and the observation that some models "overuse" them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose -- the smallest surviving unit of the structural orientation that LLMs acquire from markdown-saturated training corpora. We present a five-step genealogy connecting training data composition, structural internalization, the dual-register status of the em dash, and post-training amplification. We test this with a two-condition suppression experiment across twelve models from five providers (Anthropic, OpenAI, Meta, Google, DeepSeek): when models are instructed to avoid markdown formatting,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.