Chain-of-Thought Reasoning is a Policy Improvement Operator

Hugh Zhang; David C. Parkes

arXiv:2309.08589·cs.LG·November 9, 2023·2 cites

Chain-of-Thought Reasoning is a Policy Improvement Operator

Hugh Zhang, David C. Parkes

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that language models can self-improve their reasoning skills through a self-education loop using chain-of-thought reasoning, enabling them to solve more complex problems without extensive human-labeled data.

Contribution

Introducing SECToR, a self-learning framework where language models improve their problem-solving abilities via chain-of-thought reasoning without relying on large amounts of human-labeled data.

Findings

01

Models autonomously learn to add long-digit numbers.

02

Self-improvement occurs through iterative reasoning and training.

03

Models outperform initial capabilities after self-education loop.

Abstract

Large language models have astounded the world with fascinating new capabilities. However, they currently lack the ability to teach themselves new skills, relying instead on large amounts of human-generated training data. We introduce SECToR (Self-Education via Chain-of-Thought Reasoning), a proof-of-concept demonstration that language models can teach themselves new skills using chain-of-thought reasoning. During the self-learning loop, SECToR asks models to solve addition problems using chain-of-thought reasoning before training the next version of the model to solve those same problems directly without using such reasoning. This process often results in an improved model which can, when again augmented with chain-of-thought reasoning, solve even harder problems than the original model, allowing the self-learning loop to continue. Language models trained via SECToR autonomously learn…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The motivation for this paper is sound and strong. Self-improvement is likely to become an increasingly important area of research as data sources for training large models become exhausted. - The results are well-presented, easily interpretable and improve on the current state-of-the-art in the particular domain under consideration.

Weaknesses

- There are some missing citations e.g. in the self-improvement of LLMs space (https://arxiv.org/abs/2309.03409, https://arxiv.org/abs/2211.01910, https://arxiv.org/abs/2309.16797) and in the fine-tuning of LLMs via prompted LLMs space (https://arxiv.org/abs/2212.08410, https://arxiv.org/abs/2212.10071). - My main concern is that this paper only demonstrates the benefit of the proposed method on a single toy domain: addition of numbers with many digits. Therefore, it is hard to assess whether

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

The article introduces the SECToR method, which achieves self-learning through chain-of-thought reasoning. This approach provides a new avenue for autonomous learning in language models.

Weaknesses

1. This article lacks specific details regarding the dataset used, including the sizes of the training and testing sets, as well as the methodology for constructing the dataset (including both fast and slow datasets). 2. Figure 4 and the simplify-then-guess process is rather confusing. Can the authors provide a more detailed explanation or illustrative example of the simplify-then-guess process? (see question1,2). 3. The article should include a comparative experiment, specifically training t

Reviewer 03Rating 3· reject, not good enoughConfidence 3

Strengths

1. Linking CoT with Monte-Carlo Tree Search is an interesting and, as far as I know, novel idea. Exploring the direction of optimizing language models through self-training shows promise for model training. 2. The authors present intriguing observations on how error accumulation prevention enhances self-learning. 3. The paper is easy to follow.

Weaknesses

1. Although the proposed method demonstrates strong self-learning abilities in synthetic addition tasks, I generally perceive it as being quite specific to those tasks. More specifically, the simplify-then-guess approach and curriculum appear to necessitate a distinct hierarchical structure that can be solved recursively. 2. The experimental results are not enough to confirm the feasibility of the proposed method. The absence of baseline methods raises concerns about whether the proposed method

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling

MethodsSelf-Learning · AlphaZero · Monte-Carlo Tree Search