Instruction-following Evaluation through Verbalizer Manipulation
Shiyang Li, Jun Yan, Hai Wang, Zheng Tang, Xiang Ren, Vijay, Srinivasan, Hongxia Jin

TL;DR
This paper introduces verbalizer manipulation as a new evaluation method to assess instruction-following capabilities of language models by varying how task labels are verbalized, revealing models' reliance on priors and their ability to override them.
Contribution
It proposes a novel evaluation protocol that tests models' instruction-following by manipulating verbalizers, providing deeper insights into their reliance on priors and ability to follow instructions.
Findings
Models' performance varies significantly with different verbalizers.
Even GPT-4 struggles with less natural verbalizers, performing near random chance.
The evaluation highlights the need for improved instruction-following abilities.
Abstract
While instruction-tuned models have shown remarkable success in various natural language processing tasks, accurately evaluating their ability to follow instructions remains challenging. Existing benchmarks primarily focus on common instructions that align well with what the model learned during training. However, proficiency in responding to these instructions does not necessarily imply strong ability in instruction following. In this paper, we propose a novel instruction-following evaluation protocol called verbalizer manipulation. It instructs the model to verbalize the task label with words aligning with model priors to different extents, adopting verbalizers from highly aligned (e.g., outputting ``postive'' for positive sentiment), to minimally aligned (e.g., outputting ``negative'' for positive sentiment). Verbalizer manipulation can be seamlessly integrated with any…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. AFAIK, the paper is the first one studying the Stroop effect of LLMs. As LLMs getting stronger and stronger language-related cognitive capabilities, it is valuable to conduct these kinds of experiments to see whether LLMs are showing similar behavior with human. 2. The authors conducted thorough experiments to show different aspects of the performance in different incongruent tasks. 3. The work inspired us that LLMs still have many issues in some counter-intuitive tasks and this might raise p
1. The novelty might be a concern. The Stroop effect/test and its variations are well-known in the field of experimental psychology. The experimental details are not novel given the well established history of it. The difference of this work is changing human participants to LLMs. While ICLR is a conference mostly related to computer science, the contribution may be still considered as non-incremental given its specific focus and scope. I raise my concern here but not sure about the answer; will
This paper provides an evaluation dataset for instruction-tuned models to measure the degree to which the models follow instructions. Based on this dataset, it conducts various experiments to verify the behavior of the instruction-tuned models under different verbalizer manipulations.
This work seems to be very similar to [1]. Even the proposed three types of verbalizer manipulations have similar representations in both papers. The only apparent difference is that this work is about zero-shot learning, whereas [1] focuses on in-context learning. The evaluation dataset and models being compared are slightly different, but I feel that this work has a weak contribution, lacking novelty. If this were the first paper to raise the importance of evaluating instruction-following capa
- The paper evaluates various instruction-tuned models including FLAN-T5, GPT-series, Vicuna, and OPT-IML, enabling a comprehensive analysis. - The presentation of the paper is clear and easy to follow.
- Many evaluation datasets of the paper are mostly included for instruction tuning the training process of FLAN-T5 and OPT-IML. The tendency of Figure 2 of U-shaped and inverse scaling for unnatural instructions might be because the models have been trained on 'natural' instructions of the evaluation task (fitted on the natural instructions during training). In this sense, evaluation of unseen datasets should be also conducted for FLAN-T5 and OPT-IML. - The observations and the evaluation setti
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ferroelectric and Negative Capacitance Devices · Machine Learning and Algorithms
MethodsAttention Is All You Need · Layer Normalization · Label Smoothing · Linear Layer · Multi-Head Attention · Softmax · Dense Connections · Dropout · Byte Pair Encoding · Residual Connection
