BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Ha-Thanh Nguyen; Hideyuki Tachibana; Chaoran Liu; Qianying Liu; Su Myat Noe; Koichi Takeda; Sadao Kurohashi

arXiv:2506.06955·cs.CL·March 17, 2026

BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning

Ha-Thanh Nguyen, Hideyuki Tachibana, Chaoran Liu, Qianying Liu, Su Myat Noe, Koichi Takeda, Sadao Kurohashi

PDF

Open Access

TL;DR

This paper introduces BIS Reasoning 1.0, a large-scale Japanese benchmark dataset designed to evaluate belief-inconsistent syllogistic reasoning in large language models, revealing their strengths and limitations in logical reasoning versus belief bias.

Contribution

It provides the first large-scale Japanese dataset for belief-inconsistent reasoning and benchmarks multiple LLMs, highlighting the importance of explicit reasoning optimization for robustness.

Findings

01

Reasoning models achieve near-perfect accuracy (~99%) on the benchmark.

02

GPT-4o attains around 80% accuracy, while earlier models perform below 60%.

03

Performance depends on prompt design and reasoning effort, especially when logic conflicts with beliefs.

Abstract

We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior resources such as NeuBAROCO and JFLD, which emphasize general or belief-aligned logic, BIS Reasoning 1.0 systematically introduces logically valid yet belief-inconsistent syllogisms to expose belief bias, the tendency to accept believable conclusions irrespective of validity. We benchmark a representative suite of cutting-edge models, including OpenAI GPT-5 variants, GPT-4o, Qwen, and prominent Japanese LLMs, under a uniform, zero-shot protocol. Reasoning-centric models achieve near-perfect accuracy on BIS Reasoning 1.0 (e.g., Qwen3-32B $\approx$ 99% and GPT-5-mini up to $\approx$ 99.7%), while GPT-4o attains around 80%. Earlier Japanese-specialized models underperform, often well…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Artificial Intelligence in Healthcare and Education