TL;DR
This paper examines the ability of language models to compose skills in-context for complex tasks, revealing challenges in recognition and assembly of skills, and proposing methods to improve this capability.
Contribution
It provides a systematic analysis of in-context skill composition in language models and introduces a method to enhance their performance based on aligned examples.
Findings
Simple task examples can negatively impact performance.
Models struggle to recognize and assemble skills correctly.
Aligning examples with composition steps improves performance.
Abstract
Composing basic skills from simple tasks to accomplish composite tasks is crucial for modern intelligent systems. We investigate the in-context composition ability of language models to perform composite tasks that combine basic skills demonstrated in in-context examples. This is more challenging than the standard setting, where skills and their composition can be learned in training. We conduct systematic experiments on various representative open-source language models, utilizing linguistic and logical tasks designed to probe composition abilities. The results reveal that simple task examples can have a surprising negative impact on the performance, because the models generally struggle to recognize and assemble the skills correctly, even with Chain-of-Thought examples. Theoretical analysis further shows that it is crucial to align examples with the corresponding steps in the…
Peer Reviews
Decision·Submitted to ICLR 2026
In-depth study of in-context composition of LLMs are important to understanding the behaviors of LLMs in important functionalities like long chain-of-though reasoning or agent tasks.
1. The paper does not cite and compare with a few important related work such as: [1] J. Chen et al. 2024. Skills-in-Context: Unlocking Compositionality in Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2024. 2. The paper ignores to analyze the in-context composition behavior in the recently emerged long-CoT reasoning LLMs. It would be interesting to analyze if the “thinking” process involves the composition of such elementary skills. 3. The paper c
* The paper provides multi-view analysis—performance trends, output correspondences, attention-similarity visualization, and ablation on operator vs. content cues—all supporting a consistent explanation. * The theoretical and empirical sections are well integrated.
* Missing and Related Work: The paper omits citation and comparison with Skills-in-Context Prompting: Unlocking Compositionality in Large Language Models, which introduced a closely related framework for enabling compositional reasoning via “skills-in-context” demonstrations. That work similarly examined how combining basic skills in a single prompt affects generalization and proposed structured prompting strategies. The present submission should explicitly position itself relative to SKiC, clar
Clear negative result that challenges common intuition: more shots of sub-skills can degrade composition performance; trend is shown across many models/sizes and tasks. Careful empirical probing: shuffling to reduce order sensitivity, correspondence analysis showing models often perform just one sub-task, and operator-vs-content ablations clarifying what is attended to. Theoretical insight: formalizes when failing to distinguish data sources leads to lower bounds, and when step-aligned evidenc
Scope limited to synthetic tasks with T=2: It’s unclear whether the phenomenon and ExpCoT gains persist for (i) more realistic multi-hop QA/program induction, (ii) longer compositions (T>2), or (iii) noisy/ambiguous operators. Current tasks may overemphasize symbol/operator cues. (Datasets: 9 compositions from eight base skills.) Model coverage & claims: Results exclude strongest closed models; conclusions about “LLMs in general” may overreach. The paper notes the resource constraint, but the c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
