InSQuAD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity
Souradeep Nanda, Anay Majee, Rishabh Iyer

TL;DR
InSQuAD enhances in-context learning by using submodular mutual information to select high-quality, diverse exemplars, improving retrieval and performance across multiple datasets.
Contribution
The paper introduces a novel SMI-based selection strategy and training paradigm for better quality and diversity in in-context exemplars for ICL models.
Findings
Significant performance improvements on nine benchmark datasets.
Effective modeling of relevance and diversity in exemplar retrieval.
Enhanced ICL performance through synthetic paraphrase augmentation.
Abstract
In this paper, we introduce InSQuAD, designed to enhance the performance of In-Context Learning (ICL) models through Submodular Mutual Information} (SMI) enforcing Quality and Diversity among in-context exemplars. InSQuAD achieves this through two principal strategies: First, we model the ICL task as a targeted selection problem and introduce a unified selection strategy based on SMIs which mines relevant yet diverse in-context examples encapsulating the notions of quality and diversity. Secondly, we address a common pitfall in existing retrieval models which model query relevance, often overlooking diversity, critical for ICL. InSQuAD introduces a combinatorial training paradigm which learns the parameters of an SMI function to enforce both quality and diversity in the retrieval model through a novel likelihood-based loss. To further aid the learning process we augment an existing…
| Instance Name | Instances of for quality | Instances of for diversity |
|---|---|---|
| InSQuaD-GC | ||
| InSQuaD-FL | ||
| InSQuaD-LD |
| Method | Classification | Multi-Choice | Dialogue | Generation | Avg. | Avg. | |||||
| MRPC | SST5 | MNLI | DBpedia | RTE | HellaSwag | MWoZ | GeoQ | Xsum | Rank | Perf. | |
| Zeroshot | 0.28±0.02 | 0.26±0.02 | 0.41±0.01 | 0.52±0.02 | 0.55±0.02 | 0.19±0.01 | 0.07±0.00 | 0.63±0.01 | 0.18±0.00 | 12.2 | 0.34 |
| Random | 0.51±0.02 | 0.42±0.01 | 0.37±0.02 | 0.57±0.00 | 0.52±0.04 | 0.27±0.01 | 0.27±0.01 | 0.82±0.09 | 0.23±0.02 | 8.9 | 0.44 |
| Diversity | 0.45±0.02 | 0.42±0.03 | 0.39±0.03 | 0.59±0.04 | 0.55±0.03 | 0.24±0.01 | 0.16±0.01 | 0.89±0.03 | 0.23±0.00 | 8.6 | 0.44 |
| Least Confidence | 0.56±0.03 | 0.33±0.02 | 0.32±0.02 | 0.45±0.00 | 0.57±0.04 | 0.27±0.02 | 0.13±0.00 | 0.86±0.01 | 0.23±0.01 | 9.4 | 0.41 |
| MFL | 0.51±0.01 | 0.43±0.02 | 0.32±0.01 | 0.62±0.02 | 0.52±0.02 | 0.29±0.01 | 0.39±0.04 | 0.89±0.03 | 0.12±0.00 | 7.6 | 0.46 |
| GC | 0.47±0.01 | 0.38±0.02 | 0.35±0.01 | 0.60±0.01 | 0.51±0.03 | 0.21±0.01 | 0.18±0.01 | 0.88±0.01 | 0.24±0.00 | 10.8 | 0.42 |
| Vote-K | 0.47±0.01 | 0.40±0.01 | 0.33±0.02 | 0.63±0.01 | 0.52±0.01 | 0.25±0.02 | 0.33±0.02 | 0.89±0.02 | 0.18±0.01 | 9.9 | 0.44 |
| IDEAL | 0.47±0.02 | 0.42±0.01 | 0.35±0.01 | 0.62±0.02 | 0.54±0.01 | 0.26±0.00 | 0.36±0.01 | 0.82±0.07 | 0.19±0.01 | 8.9 | 0.45 |
| InSQuaD-FL (NT) | 0.50±0.00 | 0.40±0.04 | 0.32±0.01 | 0.66±0.05 | 0.52±0.01 | 0.28±0.01 | 0.30±0.02 | 0.80±0.04 | 0.10±0.01 | 10.3 | 0.43 |
| InSQuaD-LD (NT) | 0.47±0.03 | 0.38±0.03 | 0.37±0.01 | 0.68±0.02 | 0.55±0.02 | 0.25±0.04 | 0.39±0.00 | 0.83±0.01 | 0.22±0.03 | 8.7 | 0.46 |
| InSQuaD-GC (NT) | 0.57±0.01 | 0.41±0.03 | 0.34±0.02 | 0.65±0.03 | 0.52±0.01 | 0.27±0.02 | 0.34±0.02 | 0.84±0.00 | 0.19±0.01 | 8.4 | 0.46 |
| InSQuaD-FL | 0.50±0.02 | 0.42±0.02 | 0.39±0.02 | 0.65±0.01 | 0.56±0.00 | 0.27±0.02 | 0.34±0.03 | 0.89±0.11 | 0.24±0.01 | 5.7 | 0.47 |
| InSQuaD-LD | 0.49±0.02 | 0.40±0.00 | 0.35±0.02 | 0.67±0.07 | 0.62±0.03 | 0.27±0.02 | 0.40±0.01 | 0.89±0.01 | 0.23±0.03 | 6.1 | 0.48 |
| InSQuaD-GC | 0.58±0.03 | 0.43±0.05 | 0.44±0.02 | 0.68±0.02 | 0.57±0.02 | 0.29±0.01 | 0.40±0.01 | 0.85±0.03 | 0.23±0.01 | 3.6 | 0.50 |
| Oracle | 0.68±0.03 | 0.49±0.02 | 0.64±0.01 | 0.83±0.08 | 0.71±0.05 | 0.59±0.04 | 0.53±0.03 | 0.98±0.04 | 0.31±0.02 | 1.0 | 0.64 |
| Hyperparameter | Value |
|---|---|
| Epochs | 7 |
| Batch Size | 32 |
| Learning Rate | 3e-5 |
| Weight Decay | 0.01 |
| Learning Rate Decay Strategy | linear |
| Warmup Ratio | 0.06 |
| Optimizer | AdamW |
| Method | Classification | Multi-Choice | Dialogue | Generation | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| MRPC | SST5 | MNLI | DBpedia | RTE | HellaSwag | MWoZ | GeoQ | Xsum | ||
| Zeroshot | - | 0.28±0.02 | 0.26±0.02 | 0.41±0.01 | 0.52±0.02 | 0.55±0.02 | 0.19±0.01 | 0.07±0.00 | 0.63±0.01 | 0.18±0.00 |
| Random | - | 0.51±0.02 | 0.42±0.01 | 0.37±0.02 | 0.57±0.00 | 0.52±0.04 | 0.27±0.01 | 0.27±0.01 | 0.82±0.09 | 0.23±0.02 |
| InSQuaD-FL | 0 | 0.50±0.02 | 0.41±0.04 | 0.38±0.02 | 0.65±0.01 | 0.56±0.00 | 0.21±0.02 | 0.34±0.03 | 0.89±0.11 | 0.23±0.00 |
| InSQuaD-GC | 0 | 0.58±0.03 | 0.40±0.01 | 0.37±0.02 | 0.63±0.03 | 0.47±0.04 | 0.24±0.02 | 0.40±0.01 | 0.81±0.03 | 0.22±0.02 |
| InSQuaD-LD | 0 | 0.46±0.05 | 0.38±0.03 | 0.32±0.01 | 0.67±0.07 | 0.53±0.04 | 0.26±0.02 | 0.38±0.02 | 0.86±0.07 | 0.23±0.01 |
| InSQuaD-FL | 0.25 | 0.38±0.02 | 0.42±0.02 | 0.32±0.02 | 0.59±0.04 | 0.53±0.04 | 0.23±0.01 | 0.30±0.03 | 0.89±0.04 | 0.16±0.01 |
| InSQuaD-GC | 0.25 | 0.51±0.01 | 0.40±0.03 | 0.44±0.02 | 0.52±0.02 | 0.54±0.02 | 0.24±0.01 | 0.35±0.05 | 0.84±0.03 | 0.23±0.01 |
| InSQuaD-LD | 0.25 | 0.49±0.02 | 0.40±0.01 | 0.31±0.01 | 0.65±0.01 | 0.58±0.01 | 0.25±0.03 | 0.37±0.01 | 0.86±0.00 | 0.21±0.02 |
| InSQuaD-FL | 0.5 | 0.39±0.04 | 0.41±0.03 | 0.35±0.03 | 0.63±0.03 | 0.53±0.00 | 0.27±0.02 | 0.31±0.02 | 0.80±0.06 | 0.24±0.01 |
| InSQuaD-GC | 0.5 | 0.48±0.04 | 0.33±0.02 | 0.33±0.02 | 0.63±0.05 | 0.54±0.01 | 0.29±0.01 | 0.25±0.02 | 0.79±0.01 | 0.22±0.03 |
| InSQuaD-LD | 0.5 | 0.44±0.03 | 0.38±0.03 | 0.35±0.02 | 0.60±0.05 | 0.54±0.04 | 0.27±0.02 | 0.40±0.01 | 0.82±0.03 | 0.22±0.02 |
| InSQuaD-FL | 1 | 0.42±0.03 | 0.39±0.04 | 0.39±0.02 | 0.43±0.02 | 0.51±0.05 | 0.24±0.04 | 0.25±0.01 | 0.87±0.01 | 0.24±0.01 |
| InSQuaD-GC | 1 | 0.38±0.05 | 0.43±0.05 | 0.41±0.02 | 0.68±0.02 | 0.57±0.02 | 0.25±0.02 | 0.25±0.01 | 0.85±0.03 | 0.21±0.02 |
| InSQuaD-LD | 1 | 0.44±0.02 | 0.40±0.00 | 0.30±0.02 | 0.67±0.05 | 0.62±0.03 | 0.25±0.02 | 0.37±0.04 | 0.89±0.01 | 0.23±0.03 |
| Oracle | - | 0.68±0.03 | 0.49±0.02 | 0.64±0.01 | 0.83±0.08 | 0.71±0.05 | 0.59±0.04 | 0.53±0.03 | 0.98±0.04 | 0.31±0.02 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
InSQuaD: In-Context Learning for Efficient Retrieval via Submodular Mutual Information to Enforce Quality and Diversity
Souradeep Nanda* *Equal Contribution. Computer Science
*The University of Texas at Dallas
*Dallas, USA
Anay Majee*
Computer Science
*The University of Texas at Dallas
*Dallas, USA
Rishabh Iyer
Computer Science
*The University of Texas at Dallas
*Dallas, USA
Abstract
In this paper, we introduce, InSQuaD111Code Released at https://github.com/Ghost---Shadow/InSQuaD., designed to enhance the performance of In-Context Learning (ICL) models through Submodular Mutual Information (SMI) enforcing Quality and Diversity among in-context exemplars. InSQuaD achieves this through two principal strategies: First, we model the ICL task as a targeted selection problem and introduce a unified selection strategy based on SMIs which mines relevant yet diverse in-context examples encapsulating the notions of quality and diversity. Secondly, we address a common pitfall in existing retrieval models which model query relevance, often overlooking diversity, critical for ICL. InSQuaD introduces a combinatorial training paradigm which learns the parameters of an SMI function to enforce both quality and diversity in the retrieval model through a novel likelihood-based loss. To further aid the learning process we augment an existing multi-hop question answering dataset with synthetically generated paraphrases. Adopting the retrieval model trained using this strategy alongside the novel targeted selection formulation for ICL on nine benchmark datasets shows significant improvements validating the efficacy of our approach.
Index Terms:
In-context learning, subset selection, Submodular functions, Information Retrieval
I Introduction
In-Context Learning (ICL) [1, 2, 3] has soared to prominence in NLP by cutting down on task-specific training and enabling swift generalization to downstream tasks. It accomplishes this by injecting few-shot prompts into the context of the LLM, hereby known as exemplars, or demonstrations—to guide a Large Language Model (LLM) at test time. Following prior works [4, 5, 6], selecting these exemplars unfolds in two stages. First, we shortlist diverse exemplars from a large unlabeled corpus for human annotation (Exemplar Annotation). Next, we retrieve a small subset from this newly annotated pool that is highly relevant to the input query (Exemplar Retrieval). Finally, this subset, bundled with the query, is presented to a pretrained LLM, which produces the desired solution.
Several recent works [6, 2] underscore that the success of ICL hinges on the quality and diversity of in-context exemplars and employ selective annotation [7] to lighten the burden of labeling massive datasets by focusing on query relevance (quality), oftentimes neglecting diversity among selected exemplars. Consequently, several works adopt combinatorial techniques [5, 4] modeling ICL as a selection/summarization task, identifying prompts most pertinent to the input query. However, all selection techniques operate on a constant budget, risking the exclusion of diverse yet relevant in-context examples. Meanwhile, techniques [8, 9] that learn the parameters of the underlying retrieval model prioritize query relevance but frequently overlook an explicit treatment of both quality and diversity in the learned model.
To this end, we introduce InSQuaD, a unified combinatorial approach for In-Context Learning. Drawing on Submodular Mutual Information (SMI) functions [10, 11], InSQuaD naturally incorporates Quality, Diversity, and Order among exemplars in two phases: InSQuaD-RETRIEVE and InSQuaD-LEARN as shown in Figure 1. First, InSQuaD-RETRIEVE views selective annotation as a targeted selection problem applied in two distinct stages as discussed above. During Exemplar Annotation, diversity is enforced by maximizing the SMI among an unlabeled pool of exemplars, identifying distinct examples for annotation. Exemplar Retrieval extracts the most relevant yet diverse examples for answering an input query at runtime. Unlike existing works which either model query relevance [12, 9, 6] or employ a multi-stage approach [4] to model quality and diversity, our SMI based targeted selection strategy jointly models both traits while selecting in-context exemplars. Further, adopting greedy optimization [13] in maximizing SMI models an implicit ordering of exemplars by their incremental submodular information gains. Thus, InSQuaD elegantly weaves together all three pillars: quality, diversity, and order.
Secondly, we note that current methods rely on embeddings from pretrained retrieval models [14] that prioritize query relevance but overlook diversity. InSQuaD-LEARN addresses this gap by introducing a family of likelihood-based loss functions (Table I) to train a retrieval model that learns the parameters of an SMI function, leveraging the property of submodular functions which captures cooperation [15] and diversity [16] when minimized and maximized respectively. This formulation thus aligns with the goal of the downstream ICL tasks which is represented as a selection problem in InSQuaD. Finally, we facilitate the training of underlying retrieval models by introducing a novel dataset, augmenting exisitng multi-hop question answering datasets like HotpotQA [17] with synthetic documents (Section III-C2) aiding the capture of quality (contextual similarity) and diversity (paraphrased distractors) in the learnt representations.
The main contributions of InSQuaD can be summarized as follows -
- •
InSQuaD introduces a novel combinatorial targeted selection strategy in selecting in-context prompts for ICL based on SMI functions, inherently modeling the notions of quality and diversity.
- •
InSQuaD introduces a novel training paradigm for fine-tuning existing sentence embedding models to inherently model quality and diversity, by learning the parameters of a likelihood based SMI objective, a necessity for downstream ICL tasks.
- •
Applying both InSQuaD-RETRIEVE and InSQuaD-LEARN shows improvements of up to 21.6% on classification tasks, 16.4% on multi-choice tasks and up to 7% on generation based ICL tasks (refer Sec.IV-C) over existing baselines.
II Related Work
In-Context Learning: The exploration of in-context learning has significantly advanced our understanding of large language model (LLM) capabilities. Active example selection has been highlighted as a crucial factor for enhancing in-context learning, with [1], [2], [18] and [3] providing insights into the process. The role of demonstrations in in-context learning effectiveness has been examined by [19] and [20], emphasizing the impact of demonstration quality. Theoretical underpinnings of in-context learning are proposed by [21] and [22], discussing mechanisms like implicit Bayesian inference and structure induction. Research by [23], [24], [25] and [26] explores the scalability of in-context learning, while the function of induction heads in this paradigm is investigated by [27].
Language Model and Learning Strategies: The combination of active learning strategies with transformer architecture has paved the way for significant progress in the field. The work of [28] and [29] exemplify advancements in active learning for transformers. The seminal paper by [30], introducing the transformer model, marks a cornerstone in language model research. Further exploration into the learning mechanisms of transformers is provided by [31] and [32]. Benchmark leakage analysis done by [33] sheds interesting insights. The broader implications of foundation models are discussed by [34], complemented by surveys and studies on the emergent abilities and challenges in scaling language models [35] [36, 37]. Lastly, the importance of data distributional properties in emergent in-context learning is highlighted by [38].
III Method
III-A Problem Formulation and Notations
Given a search query and a set of relevant (in-context) demonstrations , an In-Context Learner (ICL) aims to generate a solution from an LLM , parameterized by as:
[TABLE]
The in-context demonstrations (also known as exemplars) in is mined from an unlabeled pool of documents in a few-shot () fashion, where based on their relevance to the test query . Here, the query and each in-context exemplar in is templated using a common template indicated by while the output is verbalized using the verbalizer as defined in [8]. The generation of proceeds without updating the parameters of the LLM , solely depending on the relevance and diversity of the in-context demonstrations to .
III-B Preliminaries: Submodularity
Submodular functions are set functions characterized by a natural diminishing returns property. Specifically, a set function , defined over a ground set , is submodular if it satisfies the inequality for all subsets [39]. These functions have been extensively studied for data subset selection [40, 11, 41, 42], video summarization [43, 44] etc. Traditionally, these tasks are modeled as a discrete optimization problem through submodular maximization under a knapsack constraint [13]. This can be fairly approximated with a constant factor guarantee using greedy optimization techniques [45]. In our paper we study two popular categories of submodular functions - Submodular Information Functions (SIMs) and Submodular Mutual Information Functions (SMIs). Maximizing a SIM [39] like Facility-Location, Graph-Cut etc. promotes selection of diverse examples within a set , while maximizing a SMI function (given the underlying submodular function ) selects examples that share maximum information in . A special class of submodular functions, Submodular Point Processes (SPPs) [46] presents a probabilistic likelihood formulation which has been used to learn the parameters of an underlying combinatorial function for summarization tasks. Very recently combinatorial functions have been applied as learning objectives in continuous optimization problems like longtail recognition [47] and few-shot object detection [48] for representation learning. Although several ICL approaches (refer Sec.II) adopt combinatorial approaches, we present a unified method incubating all three notions of quality and diversity, a known pitfall in literature.
III-C InSQuaD
In this section we outline the details of our proposed InSQuaD method that unifies the notions of quality and diversity into a novel combinatorial formulation. InSQuaD achieves this in two distinct stages - (1) InSQuaD-RETRIEVE which performs both exemplar annotation and retrieval to shortlist a small yet diverse set of exemplars from an unlabeled pool (exemplar annotation) and selecting the most relevant yet diverse exemplars (exemplar retrieval) for downstream ICL tasks. In contrast to other approaches in literature (refer Sec. II) which adopt separate functions for each of the above tasks InSQuaD solely relies on Submodular Mutual Information (SMI) functions modeling both tasks as a targeted selection problem discussed in Sec. III-C1. The embedding generator used during selection is a retrieval model trained using (2) InSQuaD-LEARN which learns its parameters to encapsulate both the notions of quality and diversity, unlike popularly used models [14] which inculcate only query relevance. InSQuaD achieves this through a novel combinatorial loss formulation, also based on SMI functions elucidated in Sec. III-C2. The trained weights of is consumed directly by InSQuaD-RETRIEVE which serves as an embedding generator during both the annotation and retrieval stages (weights in are frozen).
III-C1 InSQuaD-RETRIEVE
A core task in ICL is to mine diverse yet relevant in-context examples, required to answer . We model this task as a two-stage targeted selection problem with the introduction of SMI Functions (), namely Exemplar Annotation and Exemplar Retrieval. The Exemplar Annotation stage circumvents the challenge of annotating the complete unlabeled set by selecting a diverse yet representative subset of examples . We achieve this by maximizing the SMI over the unlabeled set as shown in eq. 2.
[TABLE]
Here, is the annotation budget . Note that since the query set in the SMI function is the complete ground set , this formulation boils down to maximizing the submodular function over the unlabeled set as as shown in [49]. Following the observations in [50] maximizing the submodular information over a subset models diversity among selected examples agnostic of the test query. Exemplars in are further labeled by human annotators to produce a labeled set .
Given a test query and the labeled set of exemplars , the Exemplar Retrieval phase selects the top- in-context exemplars , most relevant to answering . Although previous research (refer Sec.II) emphasizes the modeling of both quality and diversity among exemplars in , recent methods like [2, 6] focus mainly on query relevance while [4, 5] employ multi-stage selection/summarization strategies, where each stage models either relevance or diversity. InSQuaD drifts away from existing works by introducing a unified single stage formulation as in eq. 3 by maximizing the SMI between and selected exemplars in , modeling the retrieval of in-context exemplars as a targeted selection problem.
[TABLE]
The greedy optimization [13] strategy detailed in Alg.1, adopted during the selection process orders the exemplars in based on decreasing information gain. This results in an inherent ordering of the exemplars based on their relevance towards answering , a necessity in ICL [51]. Note that the Exemplar Annotation phase is performed only once whereas the Exemplar Retrieval is invoked for each test query. Our results in Tab.II show that the application of InSQuaD-RETRIEVE alone (indicated as NT) boosts performance over training free ICL methods showing the effectiveness of modeling all three notions of quality, diversity and order.
III-C2 InSQuaD-LEARN
Although training free strategies are largely popular in ICL to leverage knowledge from in-context exemplars, [8] highlights the common fallacy in LLMs [14] which for retrieval tasks model query relevance without incorporating the notion of diversity. Unlike existing approaches (refer Sec.II) InSQuaD explicitly models the notions of quality and diversity into the retrieval model , mimicking the behavior expected during downstream ICL tasks.
InSQuaD* achieves this by modeling the modeling the learning problem of as a Submodular Point Process (SPP)* [46]. Given a subset of documents retrieved from a ground set and a query , SPPs estimate the of the retrieval set being relevant to answering over all possible subsets in as shown in eq. 4. Adopting a similar choice of information function as in InSQuaD-RETRIEVE we adopt SMI functions (with underlying submodular function ) to model shared information between and .
[TABLE]
To incorporate SPPs defined above into a learning objective in InSQuaD-LEARN given a predetermined set of relevant documents and distractors , we define as the ratio of probability of selecting over distractors in , given the parameters in .
[TABLE]
Our learning objective in InSQuaD-LEARN maximizes the likelihood of to be similar to while minimizing its similarity to . Thus, by taking the negative logarithm of we define a learning objective as shown in eq. 5 which is effectively minimized during model training.
[TABLE]
Given, this loss formulation inspired from SPPs we introduce two novel formulations one enforcing quality while the other enforcing diversity in the training loop of . Note, that depends on which in turn computes embedding similarity between and / that depends on the parameter of during training.
Training Loop: For every query , in the training dataset (in practice refers to the mini-batch) posed at the retrieval model , we provide a set of exemplars relevant to answering and a set of distractors , where . Given the loss formulation in eq.5, InSQuaD-LEARN enforces quality in through (aggregated over the full dataset ) which maximizes the feature overlap between the query and (Term 2 in eq.6) while minimizing the common information between and (Term 1 in eq.6).
[TABLE]
Although enforces to model query relevance (notion of quality), continues to be susceptible to paraphrases and irrelevant documents that convey similar meaning to the exemplars in , degrading downstream performance. Given a set which are paraphrases of elements in InSQuaD-LEARN reuses the newly introduced loss formulation in eq.5 to enforce a diversity based objective to evade retrieval of such paraphrases when deployed in downstream ICL tasks. This loss formulation, shown in eq. 7 minimizes the SMI between the and the paraphrases of relevant documents (Term 1 in eq. 7) while maximizing the information overlap between the true paraphrases and the relevant documents .
[TABLE]
Finally, InSQuaD combines the notions of quality and diversity into a joint loss formulation as shown in Fig.2. Here, controls the quality-diversity trade-off during training with supporting experimental results in Sec. V. Note that by varying the choice of submodular function in both and among Graph-Cut (GC), Facility-Location (FL) and Log-Determinant (LD) we create three novel instances of as summarized in Tab. I. Results from our experiments in Tab. II combines InSQuaD-LEARN and InSQuaD-RETRIEVE into a unified formulation showing significant performance boosts over baseline methods. Note, that all instances depend on computation of a similarity kernel to encode interaction between documents and as in [47].
Training Data Curation: The training loop in InSQuaD heavily relies on irrelevant documents acting as distractors and paraphrases alongside relevant ones to bake the notions of quality and diversity into the training regime. Unfortunately popular multi-hop Question Answering (QA) datasets [52, 17] do not encapsulate paraphrases, thereby hindering models trained on them to model de-duplication impacting diversity (critical for ICL). To this end, given a standard QA dataset we augment it with synthetically generated paraphrases to create our training data. Each row in the multihop QA dataset is of form where is the question, are all supporting documents, being relevant documents and are the distractors for retrieving the answer . We synthetically generate one paraphrase for each item in and call the set . We leverage GPT-3.5 Turbo [53] to generate paraphrases for each corresponding document in for each question . This follows the trend in existing works [54, 55] which generate synthetic questions/documents for auxiliary NLP tasks. Although any multi-hop QA dataset can be used here, our experiments adopt the HotpotQA [17] dataset owing to its popularity in in multi-hop QA reflected through high citation volume (2349 citations) over contemporary multi-hop question answering datasets like MuSiQue [56] (207 citations), ConcurrentQA [57] (22 citations). We will release the augmented dataset publicly upon review.
IV Experiments
IV-A Datasets
We conduct our experiments on nine popular ICL benachmarks. The datasets employed in InSQuaD include MRPC [58], RTE [59], and MNLI [60] with binary or ternary labels (yes/no/maybe). SST5 [61] involves 5-way classification, DBPedia [62] features 14-way classification, and HellaSwag [63] requires 4-way classification. Additionally, the generative datasets MWoZ [64], GeoQ [65], and Xsum [66] are employed in our experiments.
IV-B Experimental Setup
We adopt the similar experimental setup as in [6, 5], drawing inspiration from the MetaICL framework [67] for conducting all experiments. We utilize a varied set of LLMs for downstream ICL tasks like Gemma (2B) [68], Gemma (7B) [68], and OpenAI-Davinci002 (175B) [69] as the backbone models for ICL. We adopt the SBERT [14] based retrieval model in the formulation of InSQuaD-RETRIEVE which we train using the formulation in InSQuaD-LEARN to evaluate improvements in downstream tasks.
We report the hyperparameters used in our experiments for reproducibility in Table III. We conduct three independent trials and report results with one standard deviation confidence intervals in Table II. All experiments use random data and weight initialization, resulting in variance even for non-trainable methods.
The model is trained for 7 epochs. The learning rate is set at , a value chosen to strike an optimal balance between rapid convergence and stability in training. We adopt a weight decay of 0.01 to mitigate overfitting by penalizing large weights. The learning rate follows a linear decay strategy, gradually reducing the learning rate, which is beneficial for fine-tuning the model in its later stages. The warmup ratio is set at 0.06, allowing for a gradual ramp-up of the learning rate at the beginning of training to prevent early divergences. The optimizer of choice is AdamW [70], selected for its effectiveness in handling sparse gradients and adaptive learning rate capabilities. We estimate around 527 L4 hours for all experiments. We use MPNET [71] which has 33,360,000 parameters. All experiments were conducted on a g2-standard-4 VM on GCP with 1 L4 NVidia GPU and 4 CPU cores.
IV-C Results on Existing ICL Benchmarks
We contrast the performance of our proposed InSQuaD approach against several existing baselines tabulated in Tab.II. We contrast InSQuaD against Zero shot, Random selection, Vote-K [6], and IDEAL [5] among other baselines with the underlying LLM as Gemma (2B) [68] and the annotation budget B as 18 (with additional ablation experiments in Sec.V).
At first, we compare the performance of instances of InSQuaD-RETRIEVE by varying the choice of submodular function among Graph-Cut (GC), Facility-Location (FL) and Log-Determinant (LD) with pretrained model weights for the SBERT retrieval model without applying the training strategy in InSQuaD-LEARN. This is indicated as No Training (NT) in Tab.II. Our combinatorial formulation which models the exemplar selection task as a targeted selection problem shows improvements up to 6.1% (InSQuaD-FL (NT)) on the multi-choice (HellaSwag) benchmark, up to 21.3% (InSQuaD-GC (NT)) on the classification benchmark (MRPC), up to 8.6% (InSQuaD-LD (NT)) on the Dialogue (MWoZ) benchmark and up to 24.8% (InSQuaD-LD (NT)) on the generation benchmark (Xsum) over the latest baseline (IDEAL).
Secondly, we contrast the ICL performance of InSQuaD utilizing the SBERT based retrieval model trained using the learning formulation of InSQuaD-LEARN on the novel augmented dataset discussed in Sec.III-C2. Similar to our earlier setting we conduct experiments on three instances of InSQuaD-LEARN and InSQuaD-RETRIEVE by varying the choice of submodular function among popular choices like FL, GC and LD. Note, that the choice of is consistent across InSQuaD-LEARN and InSQuaD-RETRIEVE steps i.e. if is chosen to be FL, then we adopt Facility Location Mutual Information (FLMI) for targeted selection in InSQuaD-RETRIEVE as well as for computing and in InSQuaD-LEARN. Our results show improvements of up to 21.6% (InSQuaD-GC) on classification tasks, 16.4% (InSQuaD-GC) on multi-choice tasks, 4.8% (InSQuaD-LD) on the dialogue benchmark and up to 7% (InSQuaD-FL) on generation based ICL tasks.
Although it is clear from Tab.II that finetuning the retrieval model using the learning strategy of InSQuaD-LEARN performs significantly better than their no-training (indicated as NT) counterparts, we conduct further analysis to identify the most suitable instance of InSQuaD for ICL. Among InSQuaD-GC, InSQuaD-FL and InSQuaD-LD we report the average performance and average rank across all tasks (columns 11 and 12 in Tab.II). From these additional insights we observe that InSQuaD-GC (Graph-Cut based learning objective in InSQuaD-LEARN and Graph-Cut based selection in InSQuaD-RETRIEVE) serves as the best choice in practical settings producing the best average performance and rank.
IV-D Comparing Inference Time
In stark contrast to the iterative selection methods of [6, 5], InSQuaD employs a combinatorial approach that slashes inference times (Fig.4). Meanwhile, the confidence-based selection in Vote-k and iterative influence maximization in IDEAL inflate computational costs above InSQuaD’s. By unifying the selection strategy with SMI functions, we significantly reduce inference times, with InSQuaD-GC being the swiftest. Differences among InSQuaD variants stem from kernel computations (InSQuaD-LD being the costliest), mirroring findings in earlier work [11].
V Ablations
Choice of retrieval method. We examine three retrieval strategies: Random, which assembles few-shots randomly after shortlisting; Similar, which selects the top- most similar samples; and InSQuaD-RETRIEVE. We fix the training strategy for the retrieval model as InSQuaD-LEARN (trained using the InSQuaD-FL formulation) and vary the methods used for Exemplar Annotation and Retrieval. As shown in Figure 3(b), our combinatorial formulation of InSQuaD-RETRIEVE consistently outperforms both baselines by jointly optimizing for quality and diversity in exemplar selection. This experiment highlights two key findings: (1) Regardless of the retrieval method, training the SBERT model with our combinatorial InSQuaD-LEARN objective (varied between popular submodular functions - FL, GC and LD) consistently yields notable performance gains; and (2) the combination of InSQuaD-LEARN (for training ) and targeted selection via InSQuaD-RETRIEVE achieves the highest gains in 7 out of 9 tasks.
Effect of annotation budget. We explore the effects of varying the annotation budget , using the same values as our baseline papers [5, 6] for a ceteris paribus analysis. As shown in Figure 3(c), annotation budget does not scale with accuracy and very often does not strongly correlate with performance gains. This observation is also consistent with findings from our baseline papers [5, 6]. Note, while a larger budget increases the complexity of exemplar selection by expanding the search space, our approach maintains consistent performance regardless of budget size.
Effect of (quality-diversity tradeoff). We explore the trade-off between quality and diversity in by adjusting the hyperparameter . When , the model focuses solely on query relevance (quality), ignoring paraphrase signals while, prioritizes both quality and diversity. Our results in Tab. IV clearly portray the need for both quality and diversity in the learnt embeddings of with best performances achieved by models with in 7 out of 9 tasks. Nevertheless, its also evident that the optimal value of varies based on the downstream task requiring the user to perform task specific calibration during deployment.
Variation in model size. We investigate the impact of model scale by evaluating on three underlying LLMs : Gemma (2B), Gemma (7B) [68], and Davinci-002 (175B) [69]. Our results in Fig.3(a), consistent with findings of [5], show that larger models show better overall ICL performance. However, we continue to use Gemma (2B) in our experiments (Tab. II) owing to its wide adoption in latest benchmarks [5, 4] and low parameter counts. However, adopting the methodology in InSQuaD demonstrates improvements on all three model variants indicating the generalizability of our approach.
VI Conclusion, Limitations and Future Work
InSQuaD introduces a novel combinatorial approach for In-Context Learning (ICL) leveraging SMI functions to enforce quality, diversity, and order in exemplar selection and retrieval. By pairing InSQuaD-RETRIEVE towards targeted exemplar selection with InSQuaD-LEARN for training the underlying retrieval model through likelihood-based combinatorial loss, our approach systematically improves ICL performance across nine benchmarks, validating our framework’s effectiveness. Nevertheless, The current model adopts only the HotpotQA dataset (due to its popularity in Question Answering (QA) literature) in InSQuaD-LEARN leaving other multi-hop QA datasets to be experimented with as future research. Additionally, addressing selection biases during exemplar annotation and retrieval, and improving the interpretability of our model are potential future research directions.
Acknowledgements
We gratefully thank anonymous reviewers for their valuable comments. We would also like to extend our gratitude to our fellow researchers from the CARAML lab at UT Dallas for their suggestions. This work is supported by the National Science Foundation under Grant Numbers IIS-2106937, a gift from Google Research, an Amazon Research Award, and the Adobe Data Science Research award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation, Google or Adobe.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Yiming Zhang, Shi Feng and Chenhao Tan “Active Example Selection for In-Context Learning” In ar Xiv abs/2211.04486 , 2022
- 2[2] Zhiyong Wu, Yaoxiang Wang, Jiacheng Ye and Lingpeng Kong “Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering” In ACL , 2022
- 3[3] Xiaonan Li and Xipeng Qiu “Finding Support Examples for In-Context Learning” In EMNLP , 2023
- 4[4] Lilly Kumari et al. “An End-to-End Submodular Framework for Data-Efficient In-Context Learning” In NAACL-HLT , 2024
- 5[5] Shaokun Zhang et al. “IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models” In ar Xiv abs/2310.10873 , 2023
- 6[6] Hongjin Su et al. “Selective Annotation Makes Language Models Better Few-Shot Learners” In ar Xiv abs/2209.01975 , 2022
- 7[7] Costas Mavromatis et al. “Which Examples to Annotate for In-Context Learning? Towards Effective and Efficient Selection” In ar Xiv abs/2310.20046 , 2023
- 8[8] Jiacheng Ye et al. “Compositional Exemplars for In-context Learning” In ICML , 2023
