SCREWS: A Modular Framework for Reasoning with Revisions
Kumar Shridhar, Harsh Jhamtani, Hao Fang, Benjamin Van Durme, Jason, Eisner, Patrick Xia

TL;DR
SCREWS is a modular framework that enhances reasoning with LLMs by enabling diverse revision strategies and effective selection, improving performance across various reasoning tasks.
Contribution
It introduces a flexible, modular framework for reasoning with revisions, unifying previous methods and enabling novel strategies for error correction.
Findings
Heterogeneous revision strategies improve reasoning accuracy.
Selection between original and revised outputs is crucial.
Framework achieves state-of-the-art results on multiple tasks.
Abstract
Large language models (LLMs) can improve their accuracy on various tasks through iteratively refining and revising their output based on feedback. We observe that these revisions can introduce errors, in which case it is better to roll back to a previous result. Further, revisions are typically homogeneous: they use the same reasoning method that produced the initial answer, which may not correct errors. To enable exploration in this space, we present SCREWS, a modular framework for reasoning with revisions. It is comprised of three main modules: Sampling, Conditional Resampling, and Selection, each consisting of sub-modules that can be hand-selected per task. We show that SCREWS not only unifies several previous approaches under a common framework, but also reveals several novel strategies for identifying improved reasoning chains. We evaluate our framework with state-of-the-art LLMs…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
The paper works on an interesting problem. The paper collects a couple of well known approaches and integrate them into this framework, and provide suggestions on how to use them. The paper objectively reports results, and performs analysis and comparison. Regarding ideas, self-ask with respect to multiple steps of decomposition is quite interesting.
1 the paper is a collection of existing approaches, the contribution is a bit incremental and the novelty is a bit limited. 2 the effectiveness of the proposed approach is not quite conclusive yet. - Table 1 the conclusion is sampling and conditional reasamping should use different sampling approach, i.e. CoT + Subq (QG) or Subq (QG) + CoT. However, the improvement is rather incremental (i.e. 73-> 73.99). Especially considering SOTA of GSM8K IS 90+ https://paperswithcode.com/sota/arithmetic-
1. The paper has touched upon a popular topic of LLM reasoning, especially when iterative revisions are needed. The proposed framework summarized the typical implementation of different modules. 2. The paper conducted experiments with different combinations of module instantiations and investigated their effectiveness. The experimental results have led to several interesting takeaway messages. 3. The paper is easy to follow.
The contribution of this paper seems to be incremental, as it is mainly an empirical exploration of existing module implementations. While the experimental results led to interesting observations, these observations are mostly expected, whereas the more critical questions, such as how to improve the existing selection method, are not well addressed.
This paper studies the problem of revisions in reasoning, including reducing errors introduced by revision and alleviating homogenous revisions, which are important research questions for current large language model reasoning. The authors propose a unified framework to address the questions. Many previous works can be viewed as an instance of the proposed framework. As a result, the framework is convenient for ablating the strategies during the pipeline. The experiments and analyses are compreh
Please see the questions listed below.
The proposed framework is general and modular, meaning various techniques can be employed in different stages of it.
1. > A student preparing for an exam may use deductive reasoning to solve problems and inductive reasoning to verify the results This is surely the wrong way around? 2. Figure 2 is too visually complicated to be helpful. It's better to present a simplified and more abstract pipeline than listing every component. 3. This is the main thing I am unsure about: In tables 1 and 2, the results are supposed to demonstrate the usefulness of the resampling strategy. However, in table 1, only 4 out
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
