Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

Yusuke Sakai; Hidetaka Kamigaito; Taro Watanabe

arXiv:2506.15629·cs.CL·June 19, 2025

Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability

Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

PDF

Open Access 1 Video

TL;DR

This paper introduces Ordered CommonGen, a benchmark to evaluate large language models' ability to follow instructions and generalize compositionally, revealing current models' limitations in ordered concept generation.

Contribution

The paper proposes a new benchmark for assessing instruction-following and compositional generalization in LLMs, providing a comprehensive analysis of 36 models' performance.

Findings

01

LLMs often produce low-diversity outputs biased toward specific concept orders.

02

Even the best models achieve only about 75% ordered coverage.

03

Current models need improvements in instruction-following and compositional generalization.

Abstract

In generative commonsense reasoning tasks such as CommonGen, generative large language models (LLMs) compose sentences that include all given concepts. However, when focusing on instruction-following capabilities, if a prompt specifies a concept order, LLMs must generate sentences that adhere to the specified order. To address this, we propose Ordered CommonGen, a benchmark designed to evaluate the compositional generalization and instruction-following abilities of LLMs. This benchmark measures ordered coverage to assess whether concepts are generated in the specified order, enabling a simultaneous evaluation of both abilities. We conducted a comprehensive analysis using 36 LLMs and found that, while LLMs generally understand the intent of instructions, biases toward specific concept order patterns often lead to low-diversity outputs or identical results even when the concept order is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Revisiting Compositional Generalization Capability of Large Language Models Considering Instruction Following Ability· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling