Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability
Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

TL;DR
This paper introduces Ordered CommonGen, a benchmark to evaluate large language models' ability to follow instructions and generalize compositionally, revealing current models' limitations in ordered concept generation.
Contribution
The paper proposes a new benchmark for assessing instruction-following and compositional generalization in LLMs, providing a comprehensive analysis of 36 models' performance.
Findings
LLMs often produce low-diversity outputs biased toward specific concept orders.
Even the best models achieve only about 75% ordered coverage.
Current models need improvements in instruction-following and compositional generalization.
Abstract
In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
