RLIE: Rule Generation with Logistic Regression, Iterative Refinement, and Evaluation for Large Language Models

Yang Yang; Hua XU; Zhangyi Hu; Yutao Yue

arXiv:2510.19698·cs.AI·February 16, 2026

RLIE: Rule Generation with Logistic Regression, Iterative Refinement, and Evaluation for Large Language Models

Yang Yang, Hua XU, Zhangyi Hu, Yutao Yue

PDF

Open Access 3 Reviews

TL;DR

RLIE is a framework that combines large language models with probabilistic rule learning, involving rule generation, weight learning, iterative refinement, and evaluation, to improve neuro-symbolic reasoning accuracy.

Contribution

RLIE introduces a novel integrated approach that couples LLM-generated rules with probabilistic modeling and iterative refinement for enhanced reasoning.

Findings

01

Directly applying weighted rules improves performance.

02

Prompting LLMs with rules and weights can reduce accuracy.

03

LMMs are better at semantic tasks than probabilistic integration.

Abstract

Large Language Models (LLMs) can propose rules in natural language, sidestepping the need for a predefined predicate space in traditional rule learning. Yet many LLM-based approaches ignore interactions among rules, and the opportunity to couple LLMs with probabilistic rule learning for robust inference remains underexplored. We present RLIE, a unified framework that integrates LLMs with probabilistic modeling to learn a set of weighted rules. RLIE has four stages: (1) Rule generation, where an LLM proposes and filters candidates; (2) Logistic regression, which learns probabilistic weights for global selection and calibration; (3) Iterative refinement, which updates the rule set using prediction errors; and (4) Evaluation, which compares the weighted rule set as a direct classifier with methods that inject rules into an LLM. We evaluate multiple inference strategies on real-world…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

The idea of combining LLM-based semantic rule generation with a probabilistic model for global reasoning is conceptually appealing.

Weaknesses

1. Although the paper presents a well-organized framework that integrates LLM-based rule generation with probabilistic weighting via logistic regression and iterative refinement, the overall idea is conceptually incremental. The notion of combining LLM-generated symbolic rules with classical probabilistic or statistical models has already appeared in several recent neuro-symbolic or rule-learning works. 2. The experiments rely solely on GPT-4o-mini with a near-deterministic decoding setting. It

Reviewer 02Rating 2Confidence 3

Strengths

- The paper addresses an important and challenging problem: integrating the semantic capabilities of LLMs with more structured, probabilistic reasoning frameworks. - The proposed iterative refinement loop, where the LLM is prompted to revise rules based on model errors, is an interesting idea for automated feature engineering. - The work explores different ways of combining learned rules with LLMs for inference, leading to an interesting (though negative) result about the difficulty of fine-grai

Weaknesses

- The central concept of a "rule" is ill-defined and misleading. What the paper calls "rules in natural language" are effectively just natural language prompts or questions posed to an LLM to generate ternary features (+1, 0, -1). These "rules" lack the formal structure, interpretability, and compositionality of rules in traditional symbolic systems. - The experimental comparison is flawed. The paper compares the performance of a trained logistic regression model against a prompted LLM that is g

Reviewer 03Rating 2Confidence 5

Strengths

1. The paper is well-structured and readable. 2. The methodology on using logistic regression to learn rule weights is sound and shows a certain degree of novelty, compared with the traditional top-K methods.

Weaknesses

1. Insufficient baselines for experimental comparisons. In the experiments, the authors use approximately 400 labeled samples but do not compare with the methods such as few-shot in-context learning (ICL) or fine-tuning neural networks, as discussed in the studies like: “What Makes Good In-Context Examples for GPT 3?” and “In-Context Learning Learns Label Relationships but Is Not Conventional Learning.” Moreover, the authors do not assess how the proposed method scales with varying model capac

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Computational and Text Analysis Methods · Multimodal Machine Learning Applications