Characterizing Pattern Matching and Its Limits on Compositional Task Structures

Hoyeon Chang; Jinho Park; Hanseul Cho; Sohee Yang; Miyoung Ko; Hyeonbin Hwang; Seungpil Won; Dohaeng Lee; Youbin Ahn; Minjoon Seo

arXiv:2505.20278·cs.LG·March 3, 2026

Characterizing Pattern Matching and Its Limits on Compositional Task Structures

Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper formalizes pattern matching as functional equivalence in language models, providing theoretical bounds and empirical evidence on how models generalize in compositional tasks and the limitations posed by path ambiguity.

Contribution

It introduces a formal framework for pattern matching as functional equivalence and offers theoretical bounds and empirical validation for model generalization in compositional tasks.

Findings

01

Success is predicted by the number of contexts witnessing functional equivalence.

02

A tight sample complexity bound for learning two-hop structures is established.

03

Path ambiguity impairs model accuracy and interpretability.

Abstract

Despite impressive capabilities, LLMs' successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights:…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. This paper studies an important problem of the compositional generalization of language models. 2. The experiments include various settings of practical relevance.

Weaknesses

1. The results are limited to small synthetic task structures. 2. The settings require a deterministic function and strict functional equivalence, which may be too restrictive in a real-world NLP dataset.

Reviewer 02Rating 8Confidence 4

Strengths

- At a high level, thinking about generalization in terms of many-to-one functions seems like it clearly captures a kind of task-level generalization. Completing the task correctly requires non-trivial logical reasoning. The task has the nice properties that (1) it is possible to get 100% accuracy when correctly applying logical reasoning / a graph algorithm and (2) the LLM never sees the exact problem instance it is evaluated on. - The empirical results in the paper strongly support the narrati

Weaknesses

- I found the exposition introducing the problem to be a bit confusing. It wasn’t clear to me whether pattern matching is a desirable or undesirable property of transformers (is it capturing overfitting or generalizing?) The abstract suggests that surface-level pattern matching is bad, but perhaps that deeper pattern matching (which survives multiple logical steps) is a good thing. - I am also confused about why it is interesting to understand pattern matching in LLMs. I’m not sure how the toy p

Reviewer 03Rating 6Confidence 3

Strengths

The domain setup seems to eliminate other potential sources of information cleanly. The definition of functional equivalence and specifically k-equivalence are simple and naturalistic definitions. The large sweep over a variety of dataset sizes is also helpful for determining the role of data access.

Weaknesses

The abstract and first paragraph of the introduction do not make it clear enough that “pattern matching" is undesirable. The first sentence could be read as “pattern matching" performed by LLMs as being too surface level. This reading recontextualizes later uses of the term to be neutral rather than negative, confusing such a reader. It should be made more clear that “pattern matching" specifically is being used to exclusively refer to undesirably syntactic/surface level heuristics. "Functional

Code & Models

Repositories

kaistai/coverage-principle
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEconomic, financial, and policy analysis · Italy: Economic History and Contemporary Issues · Economic Policies and Impacts

MethodsSparse Evolutionary Training