Benchmarking Large Language Models on Controllable Generation under Diversified Instructions
Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao

TL;DR
This paper introduces CoDI-Eval, a comprehensive benchmark for evaluating large language models' ability to follow diverse, constrained instructions, highlighting current limitations and gaps between different models.
Contribution
It presents a new benchmark with a diversified instruction set and automated evaluation to systematically assess LLMs' controllability in instruction-following tasks.
Findings
LLMs show limitations in adhering to specific constraints.
Significant performance gap exists between open-source and closed-source LLMs.
The benchmark facilitates future research on improving controllability.
Abstract
While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsSparse Evolutionary Training
