Active Prompting with Chain-of-Thought for Large Language Models
Shizhe Diao, Pengcheng Wang, Yong Lin, Rui Pan, Xiang Liu, Tong Zhang

TL;DR
This paper introduces Active-Prompt, a method that uses active learning principles to select the most informative questions for annotation, enhancing chain-of-thought prompting in large language models and achieving state-of-the-art results on reasoning tasks.
Contribution
It proposes an active learning approach to optimize prompt examples for LLMs, improving reasoning performance without relying solely on human-annotated exemplars.
Findings
Achieves state-of-the-art on eight reasoning tasks
Effective question selection improves model performance
Uncertainty metrics guide optimal example annotation
Abstract
The increasing scale of large language models (LLMs) brings emergent abilities to various complex tasks requiring reasoning, such as arithmetic and commonsense reasoning. It is known that the effective design of task-specific prompts is critical for LLMs' ability to produce high-quality answers. In particular, an effective approach for complex question-and-answer tasks is example-based prompting with chain-of-thought (CoT) reasoning, which significantly improves the performance of LLMs. However, current CoT methods rely on a fixed set of human-annotated exemplars, which are not necessarily the most effective examples for different tasks. This paper proposes a new method, Active-Prompt, to adapt LLMs to different tasks with task-specific example prompts (annotated with human-designed CoT reasoning). For this purpose, we propose a solution to the key problem of determining which questions…
Peer Reviews
Decision·Submitted to ICLR 2024
- Combining active learning with prompt construction is interesting and novel to me - With the extensive experiments and analysis, the execution is definitely above average - Writing is clear
- [Major] An important and very relevant baseline is missing: https://arxiv.org/abs/2210.00720. Their method is very similar to Active Prompt and simply selects the longest training instances. I would be curious to see how it compares to this work. - [Major] One can imagine that if the model is reasonably good, the demonstrations selected by Active-Prompt will be more useful. I wonder whether this is still the case for “weaker” models. If the model does not know too much about the task, will the
The idea is straightforward and the motivation is clear. The method makes sense.
1. **Baselines are too weak, leading to a misunderstanding of the effectiveness of the proposed method.** I would like to urge the authors to include more powerful baselines in the experiment rather than hide them. ALL the reviewers are experts in this domain and familiar with the state-of-the-art performance of LLMs on these benchmarks in this domain. In the experiment section, the authors only include the CoT annotations from [1] as the most important baseline. It is widely acknowledged and st
- Overall the paper is written clearly and proposes an approach for example selection for chain-of-thought prompting. The method uses existing approaches from active learning and shows improvements over baselines. - The authors evaluate their approach on a range of mathematical and commonsense reasoning tasks, and conduct ablations to understand the effect of different factors.
- The approach seems to have limited applicability as it requires the existence of either large enough datasets for a particular task or similar task to sample from. The authors also report variations between different annotators, further attesting to the difficulty of the task. - Some details in the paper are missing. For example, how is the variance based approach applied to textual answers? There are no results presented with the self-confidence approach and only an example is given, etc.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning and Algorithms
