ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix

Zile Yang; Ling Li; Na Di; Jinlong Pang; Yao Zhou; Hao Cheng; Bo Han; Jiaheng Wei

arXiv:2510.23160·cs.CL·October 28, 2025

ENTP: Enhancing Low-Quality SFT Data via Neural-Symbolic Text Purge-Mix

Zile Yang, Ling Li, Na Di, Jinlong Pang, Yao Zhou, Hao Cheng, Bo Han, Jiaheng Wei

PDF

3 Reviews

TL;DR

ENTP is a framework that revitalizes low-quality instruction-response data using symbolic purification and neural synthesis, leading to improved model performance even with less curated data.

Contribution

The paper introduces ENTP, a novel neural-symbolic method to enhance low-quality datasets for better instruction tuning of language models.

Findings

01

ENTP-augmented data outperforms 13 baselines across five benchmarks.

02

Models fine-tuned on ENTP data surpass those trained on full original datasets.

03

Neural-symbolic approach effectively leverages low-quality data for instruction alignment.

Abstract

Supervised Fine-Tuning (SFT) adapts pre-trained Large Language Models (LLMs) to domain-specific instructions by training on a carefully curated subset of high-quality instruction-response pairs, typically drawn from a larger dataset that often contains many low-quality or noisy samples. However, existing quality-first paradigms often overlook valuable signals in discarded low-quality data and rely on imperfect quality filters. We introduce ENTP (Enhancing low-quality SFT data via Neural-symbolic Text Purge-Mix), a framework that revitalizes low-quality corpora through symbolic purification and neural reconstruction. The symbolic module identifies and prunes noisy samples based on statistical priors, while the neural component synthesizes enriched instruction-response pairs by leveraging latent representations and model knowledge. This neural-symbolic synergy enhances data…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

The paper is original in reframing “low-quality” instruction data as valuable signal via a neural–symbolic *purge-mix* pipeline: it corrects noisy LLM quality scores with a score-transition matrix to address rater inconsistency, then fuses representative low-quality samples into richer training pairs using rule-guided and LLM-driven steps, rather than discarding them. This tackles two known caveats of quality-first filtering and is clearly motivated in the introduction.

Weaknesses

The method feels overengineered and difficult to follow end-to-end, which raises the barrier to adoption. A concrete, running example would help: start from two real “low-quality” instruction–response pairs, show how scores are corrected and items clustered, present the exact symbolic-loss/ICD artifact produced, and then the final fused pair. Empirically, the benchmark coverage coverage is too narrow: adding harder math benchmarks (AIME ’24/’25, MATH500) plus coding tasks, and IFEval and SimpleQ

Reviewer 02Rating 4Confidence 4

Strengths

The paper challenges the dominant "quality-first" paradigm by demonstrating that low-quality data should be enhanced rather than discarded, addressing the critical problem that high-quality instruction data has been largely exhausted. This approach achieves better results with 54K synthetic samples derived from low-quality data than training on the full 300K original dataset, providing a practical path forward as the LLM field faces a data scarcity bottleneck. The experimental design tests thre

Weaknesses

The paper uses confusing and misleading terminology throughout, particularly the term "backpropagation" which suggests gradient-based optimization but actually refers to simple rule-based prompt template switching. The "neural-symbolic" framing overstates the sophistication of what is essentially an if-else logic system that selects from 9 hand-written prompt templates based on structured error detection. The writing is dense and difficult to follow, with critical implementation details buried i

Reviewer 03Rating 2Confidence 4

Strengths

1. Although the proposed pipeline lacks genuine algorithmic novelty, it appears engineeringly robust and well-implemented, showing careful system design and integration. 2. Addresses a timely and relevant problem—the efficient utilization and enhancement of low-quality SFT data, which remains a critical bottleneck in instruction tuning. 3. Provides comprehensive empirical comparisons across multiple benchmarks and base models, offering a reasonably broad empirical validation despite limited meth

Weaknesses

1. The paper lacks genuine novelty while introducing excessive and unnecessary conceptual packaging. In essence, the method is merely a { clustering + prototype selection + data augmentation} pipeline. The first two steps mainly reassemble engineering tricks already well-documented in prior works, offering no new algorithmic contribution. The third step is essentially an overcomplicated extension of self-refine[1] or critic-LLM[2] frameworks, repackaged under heavy “neural-symbolic fusion” term

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.