Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models
Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, and Irene Li

TL;DR
Omanic introduces a multi-hop QA dataset with step-by-step reasoning annotations to evaluate and improve large language models' reasoning capabilities, revealing current limitations and enabling transfer learning.
Contribution
The paper presents Omanic, a new dataset with structural reasoning annotations, and demonstrates its effectiveness for evaluating and enhancing reasoning in large language models.
Findings
State-of-the-art LLMs achieve only 73.11% accuracy on OmanicBench.
Fine-tuning on OmanicSynth improves performance across reasoning benchmarks.
Stepwise analysis highlights the importance of factual completeness and error propagation.
Abstract
Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
