In-Context Learning with Iterative Demonstration Selection
Chengwei Qin, Aston Zhang, Chen Chen, Anirudh Dagar, Wenming Ye

TL;DR
This paper introduces Iterative Demonstration Selection (IDS), a method that improves in-context learning by iteratively selecting demonstrations based on both diversity and similarity, leading to better performance across various tasks.
Contribution
The paper proposes IDS, a novel iterative demonstration selection method that combines diversity and similarity for improved in-context learning performance.
Findings
IDS outperforms existing demonstration selection methods.
IDS improves reasoning, question answering, and topic classification tasks.
Iterative selection enhances the relevance and diversity of demonstrations.
Abstract
Spurred by advancements in scale, large language models (LLMs) have demonstrated strong few-shot learning ability via in-context learning (ICL). However, the performance of ICL has been shown to be highly sensitive to the selection of few-shot demonstrations. Selecting the most suitable examples as context remains an ongoing challenge and an open problem. Existing literature has highlighted the importance of selecting examples that are diverse or semantically similar to the test sample while ignoring the fact that the optimal selection dimension, i.e., diversity or similarity, is task-specific. Based on how the test sample is answered, we propose Iterative Demonstration Selection (IDS) to leverage the merits of both dimensions. Using zero-shot chain-of-thought reasoning (Zero-shot-CoT), IDS iteratively selects examples that are diverse but still strongly correlated with the test sample…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. The paper is well-written, and the proposed method IDS is easy to follow. 2. This paper provides a conclusion that both similarity and diversity are important in example selection in ICL scenarios, which can help other researchers who are working on a similar area.
1. Experiments are conducted based on simple tasks such as classification, commonsense reasoning, etc., on which the improvements seem marginal. More complex and difficult generative tasks, including mathematical reasoning, QA, and machine translation, are encouraged to be adapted. 2. The conclusion that “it is unreasonable to claim that one dimension is consistently better than the other across different tasks” is drawn through only two datasets, AGNews and CommonsenseQA, which is not that soli
- The paper is well motivated with preliminary experimental results. - The literature review is good.
- Overall, the presentation of the paper should be improved, and the technical novelty is limited. In particular, Figure 2 should be carefully polished. - The explanation of why IDS can incorporate diversity should be made clear. I notice the argument "they can be different during iterations to ensure diversity because the reasoning paths vary in different iterations", but how can you ensure such diversity? Purely rely on the randomness in LLM sampling? - It seems that the evaluation tasks are r
1. The methodology is well-explained, with IDS applying Zero-shot-CoT to the test sample before demonstration selection. The output reasoning path is iteratively used to choose demonstrations that are prepended to the test sample for inference. After several iterations, IDS adopts majority voting to obtain the final result. 2. Experiments on various tasks and thorough analysis on hyper-parameters (e.g., number of demonstrations and number of iterations) demonstrate the effectiveness of IDS.
1. Lack of comparison with stronger baselines. Much related work and methods for in-context example selection (e.g., EPR [1], TST[2], CEIL[3], Skill-KNN[4]) are not experimentally compared (or even not mentioned). At least some of them should appear in the experiments part. 2. Actually, this work is not "the first time consider both the diversity and similarity dimensions of ICL demonstration selection for LLMs". For instance, [5] use MMR that considers both similarity and diversity. [6] also d
* The paper is written in a clear manner. * There is an experiment demonstrating the need to consider diversity and similarity in a task-specific manner. * There are several analyses regarding the number of demonstrations, the number of iterations, and model types.
* The experimental results look somewhat marginal. * The performed experiments are too limited in scope. They were conducted only on CommonsenseQA, BoolQ, AGNews, DBPedia, SST2, and Amazon, which undermines the robustness of the methodology. Results from a more diverse set of tasks are needed e.g., not the classification tasks. * Measuring cosine similarity between the reasoning path R and the training set lacks some persuasiveness as a criterion for selecting few-shot samples. If there are the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
