Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

A. Bochkov

arXiv:2507.04886·cs.CL·October 16, 2025

Emergent Semantics Beyond Token Embeddings: Transformer LMs with Frozen Visual Unicode Representations

A. Bochkov

PDF

Open Access 1 Repo 10 Models

TL;DR

This paper demonstrates that large language models can achieve high performance using frozen Unicode visual embeddings instead of trainable semantic embeddings, highlighting the emergent nature of semantics.

Contribution

It introduces a method of using frozen Unicode visual embeddings in Transformer models, challenging the traditional view of trainable semantic input embeddings.

Findings

01

Models with frozen Unicode embeddings outperform those with trainable embeddings on reasoning tasks.

02

High-level semantics emerge from the model architecture and data, not from trainable input embeddings.

03

The approach is compatible with any tokenizer, including a new Unicode-centric tokenizer.

Abstract

Understanding the locus of semantic representation in large language models (LLMs) is crucial for interpretability and architectural innovation. The dominant paradigm posits that trainable input embeddings serve as foundational "meaning vectors." This paper challenges that view. We construct Transformer models where the embedding layer is entirely frozen, with vectors derived not from data, but from the visual structure of Unicode glyphs. These non-semantic, precomputed visual embeddings are fixed throughout training. Our method is compatible with any tokenizer, including a novel Unicode-centric tokenizer we introduce to ensure universal text coverage. Despite the absence of trainable, semantically initialized embeddings, our models converge, generate coherent text, and, critically, outperform architecturally identical models with trainable embeddings on the MMLU reasoning benchmark. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AVBochkov/Embeddings
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning

MethodsDropout · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Layer Normalization · Dense Connections · Softmax · Transformer