Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

Mohna Chakraborty; Adithya Kulkarni; Qi Li

arXiv:2507.00330·cs.CL·July 2, 2025

Modeling Data Diversity for Joint Instance and Verbalizer Selection in Cold-Start Scenarios

Mohna Chakraborty, Adithya Kulkarni, Qi Li

PDF

Open Access

TL;DR

This paper introduces COLDSELECT, a method that models data diversity to improve joint instance and verbalizer selection in cold-start prompt-based learning, leading to better generalization without labeled data.

Contribution

COLDSELECT is the first approach to jointly model data diversity for instance and verbalizer selection, addressing the dependency between them in cold-start scenarios.

Findings

01

Outperforms baselines in reducing uncertainty

02

Enhances generalization in few-shot learning

03

Effective across eight benchmark datasets

Abstract

Prompt-based methods leverage the knowledge of pre-trained language models (PLMs) trained with a masked language modeling (MLM) objective; however, these methods are sensitive to template, verbalizer, and few-shot instance selection, particularly in cold-start settings with no labeled data. Existing studies overlook the dependency between instances and verbalizers, where instance-label probabilities depend on verbalizer token proximity in the embedding space. To address this, we propose COLDSELECT, a joint verbalizer and instance selection approach that models data diversity. COLDSELECT maps PLM vocabulary and $h_{[M A S K]}$ embeddings into a shared space, applying dimensionality reduction and clustering to ensure efficient and diverse selection. By optimizing for minimal uncertainty and maximal diversity, COLDSELECT captures data relationships effectively. Experiments on eight benchmarks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Artificial Intelligence in Healthcare and Education