Understanding Subword Compositionality of Large Language Models

Qiwei Peng; Yekun Chai; Anders S{\o}gaard

arXiv:2508.17953·cs.CL·August 26, 2025

Understanding Subword Compositionality of Large Language Models

Qiwei Peng, Yekun Chai, Anders S{\o}gaard

PDF

1 Video

TL;DR

This paper investigates how large language models compose subword units into word representations, revealing three distinct compositional strategies and their evolution across model layers, providing insights into their internal dynamics.

Contribution

It offers a comprehensive experimental analysis of subword compositionality in LLMs, identifying three distinct compositional patterns and their implications.

Findings

01

Three patterns in structural similarity evolution across layers

02

High sensitivity to semantic decompositionality at different layers

03

Distinct patterns in sensitivity to formal features like character length

Abstract

Large language models (LLMs) take sequences of subwords as input, requiring them to effective compose subword representations into meaningful word-level representations. In this paper, we present a comprehensive set of experiments to probe how LLMs compose subword information, focusing on three key aspects: structural similarity, semantic decomposability, and form retention. Our analysis of the experiments suggests that these five LLM families can be classified into three distinct groups, likely reflecting difference in their underlying composition strategies. Specifically, we observe (i) three distinct patterns in the evolution of structural similarity between subword compositions and whole-word representations across layers; (ii) great performance when probing layer by layer their sensitivity to semantic decompositionality; and (iii) three distinct patterns when probing sensitivity to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Understanding Subword Compositionality of Large Language Models· underline