Semi-structured LLM Reasoners Can Be Rigorously Audited

Jixuan Leng; Cassandra A. Cohen; Zhixian Zhang; Chenyan Xiong; William W. Cohen

arXiv:2505.24217·cs.CL·September 30, 2025

Semi-structured LLM Reasoners Can Be Rigorously Audited

Jixuan Leng, Cassandra A. Cohen, Zhixian Zhang, Chenyan Xiong, William W. Cohen

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Semi-Structured Reasoning Models (SSRMs) that produce audit-friendly reasoning traces, enabling automatic detection of errors without sacrificing model accuracy across multiple benchmarks.

Contribution

The paper presents SSRMs that generate semi-structured reasoning traces, facilitating rigorous auditing methods to detect reasoning errors in large language models.

Findings

01

SSRMs produce audit-friendly reasoning traces.

02

Auditing methods effectively identify reasoning errors.

03

SSRMs maintain high accuracy across benchmarks.

Abstract

Although Large Language Models (LLMs) have become capable reasoners, the problem of faithfulness persists: their reasoning can contain errors and omissions that are difficult to detect and that may obscure biases in model outputs. To address this issue, we introduce Semi-Structured Reasoning Models (SSRMs), which are trained to produce semi-structured representations of reasoning. SSRMs generate reasoning traces in a non-executable Pythonic syntax that names each reasoning step and marks its inputs and outputs. This structure allows SSRM traces to be automatically audited to identify reasoning flaws. We evaluate three types of audits: hand-crafted structured reasoning audits, written in a domain-specific language (DSL) implemented in Python; LLM-generated structured reasoning audits; and learned typicality audits, which apply probabilistic models over reasoning traces. We show that all…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

- The problem is clearly motivated and well presented. - I particularly liked the typicality audits based on reasoning patterns - The results on MedCalc are quite comprehensive

Weaknesses

- My main concern: Why not generate certifiably correct reasoning? ****If the reasoning process follows a DSL, why not constrain generation to produce provably valid traces by construction (similar to Poesia et al.'s certified reasoning with LLMs)? The paper audits **after** generation, but doesn't explain why generation-time constraints aren't preferable. This would eliminate many errors rather than just detecting them. Certified Deductive Reasoning with Language Models (https://arxiv.org/abs/

Reviewer 02Rating 2Confidence 4

Strengths

The paper tackles a significant, well-motivated problem of how to test the validity of LLM reasoning beyond looking at their final answer. The general idea of proposing a loose structure makes sense, and it not being fully symbolic gives it flexibility to work across domains. While this has broadly been explored, the idea of allowing users to write programmatic audits is novel as far as I'm aware.

Weaknesses

The paper lacks important details about most of the method. Central to SSRMs is the format that is enforced, but the format itself is only vaguely described (there's the example in Figure 1, but the format is barely mentioned in Section 3). I'm confused by the fact that the representation is a Pandas DataFrame, since programs are hierarchical (and even if the trace is just linear, each function call has a variable number of arguments, which I would assume map to columns in the data frame). Thus,

Reviewer 03Rating 4Confidence 4

Strengths

- The paper tackles the highly significant and timely problem of ensuring the faithfulness of LLM reasoning. As LLMs are increasingly deployed in high-stakes domains, developing methods for auditing their reasoning processes is of critical importance, and this work makes a valuable contribution in this direction. - The central idea of leveraging semi-structured representations for automated auditing is both novel and elegant. It offers a promising paradigm for moving beyond opaque, free-form tex

Weaknesses

- The paper's clarity could be significantly improved, particularly concerning the methodological details. As it stands, some aspects of the implementation are challenging to fully understand, which may hinder reproducibility. - The motivation behind choosing a "Pythonic syntax" would benefit from a more thorough discussion. Providing a comparison with alternative formalisms and explaining the trade-offs would help justify this specific design choice. - The paper builds heavily on Program Trace

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLaw, Economics, and Judicial Systems · Artificial Intelligence in Law

MethodsADaptive gradient method with the OPTimal convergence rate