Complex Logical Instruction Generation

Mian Zhang; Shujian Liu; Sixun Dong; Ming Yin; Yebowen Hu; Xun Wang; Steven Ma; Song Wang; Sathish Reddy Indurthi; Haoyun Deng; Zhiyu Zoey Chen; Kaiqiang Song

arXiv:2508.09125·cs.CL·January 28, 2026

Complex Logical Instruction Generation

Mian Zhang, Shujian Liu, Sixun Dong, Ming Yin, Yebowen Hu, Xun Wang, Steven Ma, Song Wang, Sathish Reddy Indurthi, Haoyun Deng, Zhiyu Zoey Chen, Kaiqiang Song

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces LogicIFGen and LogicIFEval, a framework and benchmark for evaluating LLMs on complex, logic-rich instructions derived from code functions, revealing current models' limitations in following such instructions.

Contribution

The paper presents a novel automated method to generate and evaluate complex logic-based instructions for LLMs, highlighting their current performance gaps.

Findings

01

Most LLMs follow fewer than 60% of instructions

02

LogicIFEval contains 426 complex, verifiable instructions

03

Current models struggle with logic-rich instructions

Abstract

Instruction following has catalyzed the recent era of Large Language Models (LLMs) and is the foundational skill underpinning more advanced capabilities such as reasoning and agentic behaviors. As tasks grow more challenging, the logic structures embedded in natural language instructions becomes increasingly intricate. However, how well LLMs perform on such logic-rich instructions remains under-explored. We propose LogicIFGen and LogicIFEval. LogicIFGen is a scalable, automated framework for generating verifiable instructions from code functions, which can naturally express rich logic such as conditions, loops, and function calls. We further curate a collection of complex code functions and use LogicIFGen to construct LogicIFEval, a benchmark comprising 426 verifiable logic-rich instructions. Our experiments demonstrate that current state-of-the-art LLMs still struggle to correctly…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- The paper studies an interesting problem of building complex natural language instructions from code and test the model's instruction following ability by using these code generated instructions. - The paper's main pipeline of building these instructions is interesting and solid. The paper also did a decent job in collecting the coding problems which could be a contribution to the community.

Weaknesses

- The naming is very confusing. Fundamentally LogicIFGen is the framework and LogicIFEval is derived from using this. But the naming made this very misleading. - I think the analysis part is not solid enough. For example, some simple reasoning baselines such as Program-of-Thought should be tested and analyzed. Since the instruction is derived from code, I think in general the paper should consider the aspect of reasoning with code.

Reviewer 02Rating 2Confidence 3

Strengths

Overall, the paper is easy to follow. The experiments appear to be fully documented and reproducible, and the topic of assessing LLM abilities in relation to the logical complexity of the task is highly relevant.

Weaknesses

1. The paper's main weakness is the limited relevance and coherence of the analysis conducted. Although assessing the abilities of LLM in instruction-following tasks by investigating dependency on increasing levels of logical complexity is promising, the main results focus on comparing the performance of different models. The overall pattern of declining performance with increasing instruction complexity is briefly discussed. However, the subsequent analysis of different failure modes does not

Reviewer 03Rating 4Confidence 4

Strengths

1. The proposed framework, LogicIFGen, is scalable and verifiable. The authors also conduct human studies to show that the generation is of high quality. 2. The research question itself is interesting and important: LLMs have to be able to follow various types of instructions, which potentially involve complex logic relations, to properly serve users. 3. Writing and presentation are very clear. It is very easy to understand the authors' points.

Weaknesses

1. Related work section is not comprehensive enough. It reviews many works which are relevant but not closely relevant to the research question in the general LLM reasoning area. More strongly related works should be thoroughly discussed as I will elaborate below. 2. The novelty of the research question is not extremely clear given many existing works in highly similar areas. This is my biggest concern. I think there are many existing works in code execution which are highly relevant, but not f

Code & Models

Datasets

billmianz/LogicIFEval
dataset· 70 dl
70 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques