Randomly Sampled Language Reasoning Problems Elucidate Limitations of In-Context Learning
Kavi Gupta, Kate Sanders, Armando Solar-Lezama

TL;DR
This paper investigates the limitations of in-context learning in large language models by testing them on simple, unseen language tasks, revealing they underperform compared to n-gram models even in minimal complexity settings.
Contribution
It introduces a novel experimental setup with randomly sampled simple language tasks to isolate in-context learning performance from model knowledge, highlighting fundamental limitations.
Findings
LLMs underperform n-gram models on simple language tasks
In-context learning does not outperform basic statistical models in this setting
LLMs struggle with unseen language tasks regardless of task simplicity
Abstract
While LLMs have revolutionized the field of machine learning due to their high performance on a strikingly wide range of problems, they are also known to hallucinate false answers and underperform on less canonical versions of the same tasks. There are several emerging theories of LLM performance, among them that LLMs lack world modeling ability, that they have an undesirable bias towards an autoregressive prior, and that they struggle on more novel problems. The existing literature on LLM input novelty has focused on tasks of relatively high complexity, studying perturbations of canonical but complex problems. In this paper, we attempt to minimize complexity in order to isolate novelty as a factor in LLM underperformance and investigate the power of in-context-learning. To this end, we consider an extremely simple domain: next token prediction on simple language tasks. The twist is…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper introduces a novel benchmark that specifically targets and isolates the in-context learning ability of LLMs in a controlled setting involving synthetic or "alien" languages. 2. The authors provide a comprehensive comparison of multiple LLMs and statistical models, offering a broad overview of current model capabilities on the proposed task.
1. While the use of next-token prediction accuracy is straightforward, the paper could benefit from including additional evaluation metrics—such as error type analysis or model calibration measures—to provide deeper insights into the specific failure modes of the models. 2. The analysis of why LLMs perform well on natural or regular languages but fail on randomly generated ones is limited. A more in-depth investigation into this contrast would strengthen the paper's impact.
The paper addresses an interesting question about the nature of in-context learning in LLMs. The setup is well-motivated, accompanied by a thorough related work section. The authors justify their benchmark design choices well and provide clear reasoning behind each task setup. The experimental evaluation also covers a wide range of pretrained LLMs, along with detailed reporting of implementation, prompting formats, and compute usage. The additional ablations, such as varying the number of in-co
(See questions)
- The authors study a clear hypothesis of whether current LLMs act as general-purpose in-context learners and highlight through a study with regular languages that they are not; thereby clearly validating their hypothesis. In fact, it is a clear study with a testable hypothesis and the authors succinctly provide a conclusion to the question that they ask. - The evaluation conducted is quite thorough from the lens of the number of DFAs and examples used to evaluate the models as well as the vario
- While I commend the authors on the clear, well-studied and scoped-out problem formulation, I struggle to grasp the more general benefit behind such an analysis. What the authors show is a clear existence of a problem which can be templated within the language domain on which LLMs struggle. However, a discussion along the lines of *no free lunch* would have been quite helpful, in fact it is not unbelievable that such models would struggle on a lot of language-based tasks which are, in some sens
1. The presentation of the experiments and results is clear without any obvious/major issues. 2. The evaluations cover a wide variety of models, including open-weights, open-code, and proprietary models.
1. The novelty of the findings in this paper is rather limited and does not justify the claims made in Section 1 (contributions). In particular, the authors claim that they introduce an LLM ICL benchmark for novel tasks using the regular languages, but do not justify the motivation for such a new benchmark when compared to existing works, such as RegBench in [1]. Also, no experiment/result discusses the effects of RLHF on the ICL performance. I would suggest a rephrasing of the claims to avoid s
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
