Learning Composable Chains-of-Thought

Fangcong Yin; Zeyu Leo Liu; Liu Leqi; Xi Ye; Greg Durrett

arXiv:2505.22635·cs.CL·May 29, 2025

Learning Composable Chains-of-Thought

Fangcong Yin, Zeyu Leo Liu, Liu Leqi, Xi Ye, Greg Durrett

PDF

Open Access 4 Reviews

TL;DR

This paper explores methods for training large language models to generalize reasoning skills compositionally by modifying chain-of-thought formats and combining atomic task models, improving zero-shot performance on unseen compositional tasks.

Contribution

It introduces a novel approach of composable chain-of-thought formats and combining atomic task models to enhance reasoning generalization beyond training data.

Findings

01

Composable CoT improves zero-shot performance

02

Combining atomic models outperforms multitask learning

03

Bootstrapping with rejection sampling fine-tuning enhances results

Abstract

A common approach for teaching large language models (LLMs) to reason is to train on chain-of-thought (CoT) traces of in-distribution reasoning problems, but such annotated data is costly to obtain for every problem of interest. We want reasoning models to generalize beyond their training distribution, and ideally to generalize compositionally: combine atomic reasoning skills to solve harder, unseen reasoning tasks. We take a step towards compositional generalization of reasoning skills when addressing a target compositional task that has no labeled CoT data. We find that simply training models on CoT data of atomic tasks leads to limited generalization, but minimally modifying CoT formats of constituent atomic tasks to be composable can lead to improvements. We can train "atomic CoT" models on the atomic tasks with Composable CoT data and combine them with multitask learning or model…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

Turning atomic CoT data into a “composable” format is easy to implement (two tags + random‑letter prefixes) and consistently lifts zero‑shot/limited‑shot composition across tasks and two 7B bases. The construction is clearly depicted (Fig. 2), and ablations show random letters are the most robust proxy prefix out‑of‑domain.

Weaknesses

1. "Zero-shot" claim is fragile due to validation-time merging sweeps. For Task Arithmetic, the paper sweeps $\alpha, \beta$ on a validation set for each task (App. G.4). If this validation set is the compositional task, tuning leaks target supervision into model selection and weakens the zero-shot claim; at minimum this needs to be clarified and a version without compositional validation should be reported. 2. Heavy reliance on explicit tags and random prefixes; external validity is limited. Th

Reviewer 02Rating 4Confidence 4

Strengths

1. The problem of compositional generalization is a core ML problem and I appreciate the authors trying to address it in the modern setting: standard LLMs have been shown to be incapable of large scale compositional reasoning. This approach is interesting and tries to leverage CoT that has been shown to help with logical/math reasoning for compositional generalization. 2. Well written and motivated empirically. 3. Strong results of the proposed approach over baselines are interesting to see, th

Weaknesses

1. The main weakness of this paper is that it doesn't experiment with compositionality enough. The tasks are also not compositional enough and there is a risk of templating/pattern-matching hacking going on here. Symbolic manipulation of some task with quantifiable controllable compositionality (e.g. n digit multiplication. Multiplication of n digits k times) would be interesting to see. 2 way compositional results are interesting to study as a starter but you should not stop at 3 way compositio

Reviewer 03Rating 2Confidence 4

Strengths

The proposed method is simple and shows significant improvements on synthetic tasks.

Weaknesses

1. The paper only conducts experiments on synthetic tasks. It is unclear how the proposed method can be applied to real-world scenarios such as math reasoning or code generation. In particular, identifying the atomic tasks in these domains is non-trivial. I also doubt whether simply data augmentation without explicitly training the model for composition can lead to meaningful improvements on realistic tasks. 2. I am uncertain about the broader impact of this work. To advance the frontier of mode

Reviewer 04Rating 6Confidence 4

Strengths

- This study conducts an interesting and important research topic: generalization to compositional reasoning tasks by using only atomic reasoning data at training. - The experiments demonstrate the effectiveness of the proposed method though they are toy experiments. - The paper is well-written and easy to follow.

Weaknesses

- I think we need an additional ablation study about why the simple trick (adding just random tags) works. For example, which is more important, tag or random text? what if we remove or change the tag? what if we change the style of random text?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Advanced Graph Neural Networks