Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Xiaojie Gu; Sherry T. Tong; Aosong Feng; Sophia Simeng Han; Jinghui Lu; Yingjian Chen; Yusuke Iwasawa; Yutaka Matsuo; Chanjun Park; Rex Ying; and Irene Li

arXiv:2603.16654·cs.CL·March 18, 2026

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, and Irene Li

PDF

Open Access 1 Datasets

TL;DR

Omanic introduces a multi-hop QA dataset with step-by-step reasoning annotations to evaluate and improve large language models' reasoning capabilities, revealing current limitations and enabling transfer learning.

Contribution

The paper presents Omanic, a new dataset with structural reasoning annotations, and demonstrates its effectiveness for evaluating and enhancing reasoning in large language models.

Findings

01

State-of-the-art LLMs achieve only 73.11% accuracy on OmanicBench.

02

Fine-tuning on OmanicSynth improves performance across reasoning benchmarks.

03

Stepwise analysis highlights the importance of factual completeness and error propagation.

Abstract

Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

li-lab/Omanic
dataset· 72 dl
72 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques