Benchmarking Large Language Models on Controllable Generation under   Diversified Instructions

Yihan Chen; Benfeng Xu; Quan Wang; Yi Liu; Zhendong Mao

arXiv:2401.00690·cs.CL·January 2, 2024·2 cites

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

Yihan Chen, Benfeng Xu, Quan Wang, Yi Liu, Zhendong Mao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces CoDI-Eval, a comprehensive benchmark for evaluating large language models' ability to follow diverse, constrained instructions, highlighting current limitations and gaps between different models.

Contribution

It presents a new benchmark with a diversified instruction set and automated evaluation to systematically assess LLMs' controllability in instruction-following tasks.

Findings

01

LLMs show limitations in adhering to specific constraints.

02

Significant performance gap exists between open-source and closed-source LLMs.

03

The benchmark facilitates future research on improving controllability.

Abstract

While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xt-cyh/codi-eval
pytorchOfficial

Videos

Benchmarking Large Language Models on Controllable Generation under Diversified Instructions· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification

MethodsSparse Evolutionary Training