Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation

Abhilekh Borah; Shubhra Ghosh; Kedar Joshi; Aditya Kumar Guru; Kripabandhu Ghosh

arXiv:2602.01132·cs.CL·February 3, 2026

Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation

Abhilekh Borah, Shubhra Ghosh, Kedar Joshi, Aditya Kumar Guru, Kripabandhu Ghosh

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper introduces Logifus, a logical obfuscation framework, and LogiQAte, a diagnostic benchmark, revealing that current LLMs' reasoning abilities are significantly hindered by logical obfuscation, exposing their superficial understanding.

Contribution

The paper presents Logifus and LogiQAte, novel tools for evaluating LLM robustness against logical obfuscation, highlighting vulnerabilities in current models' reasoning capabilities.

Findings

01

Obfuscation reduces GPT-4o performance by 47%.

02

Performance drops by 27% for GPT-5.

03

Reasoning models' accuracy decreases by 22%.

Abstract

Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

abhilekhborah/LogiQAte
dataset· 12 dl
12 dl

Videos

Don’t Judge a Book by its Cover: Testing LLMs’ Robustness Under Logical Obfuscation· underline

Taxonomy

TopicsTopic Modeling · Benford’s Law and Fraud Detection · Advanced Graph Neural Networks