LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

Philipp Mondorf; Samuel J. Bell; Jesse Dodge; Dieuwke Hupkes

arXiv:2605.15393·cs.LG·May 18, 2026

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

Philipp Mondorf, Samuel J. Bell, Jesse Dodge, Dieuwke Hupkes

PDF

TL;DR

This paper introduces LPDS, a framework for systematically evaluating LLM robustness by identifying and testing the most challenging logic-preserving variations, revealing significant performance drops and guiding more effective fine-tuning.

Contribution

LPDS provides a systematic method to quantify and find the most difficult problem variations, improving robustness evaluation and training strategies for LLMs.

Findings

01

Performance drops up to 5 times larger with LPDS compared to random sampling.

02

LPDS efficiently finds difficult variations that induce model failures.

03

Fine-tuning on difficult variations yields more consistent robustness improvements.

Abstract

As large language models (LLMs) are increasingly deployed to perform tasks with minimal human oversight, it is crucial that these models operate robustly. In particular, a model that can solve a given problem should not fail simply because certain entities $\unicode x 2013$ such as names, numbers, or other contextual details $\unicode x 2013$ have changed while the underlying problem logic remains the same. Prior work suggests that current LLMs still struggle with this form of robustness: they often succeed on some variations of a problem but fail on others. However, existing evaluations often lack a systematic way to identify which logic-preserving variations are most likely to induce failure. Instead, they typically test a random subset of allowable variations, which can overstate robustness. To address this gap, we introduce logic-preserving difficulty scaling (LPDS), a framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.