ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

Pengze Li; Jiaqi Liu; Junchi Yu; Lihao Liu; Mingyu Ding; Wanli Ouyang; Shixiang Tang; Xi Chen

arXiv:2511.12485·cs.AI·November 18, 2025

ARCHE: A Novel Task to Evaluate LLMs on Latent Reasoning Chain Extraction

Pengze Li, Jiaqi Liu, Junchi Yu, Lihao Liu, Mingyu Ding, Wanli Ouyang, Shixiang Tang, Xi Chen

PDF

Open Access

TL;DR

This paper introduces ARCHE, a new task and benchmark for evaluating large language models' ability to explicitly extract and categorize reasoning steps into standard paradigms, revealing current models' limitations in scientific reasoning.

Contribution

The paper presents ARCHE, a novel task and benchmark for explicit reasoning chain extraction, along with logic-aware metrics and an evaluation of 10 leading LLMs.

Findings

01

Models show a trade-off between content coverage and logical validity.

02

No current model can fully extract complete, standard reasoning chains.

03

Significant gap exists between current LLM reasoning abilities and scientific inference requirements.

Abstract

Large language models (LLMs) are increasingly used in scientific domains. While they can produce reasoning-like content via methods such as chain-of-thought prompting, these outputs are typically unstructured and informal, obscuring whether models truly understand the fundamental reasoning paradigms that underpin scientific inference. To address this, we introduce a novel task named Latent Reasoning Chain Extraction (ARCHE), in which models must decompose complex reasoning arguments into combinations of standard reasoning paradigms in the form of a Reasoning Logic Tree (RLT). In RLT, all reasoning steps are explicitly categorized as one of three variants of Peirce's fundamental inference modes: deduction, induction, or abduction. To facilitate this task, we release ARCHE Bench, a new benchmark derived from 70 Nature Communications articles, including more than 1,900 references and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Advanced Graph Neural Networks