Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task

Nick Ferguson; Alan Bundy; Kwabena Nuamah

arXiv:2601.07696·cs.CL·January 13, 2026

Exploring the Meta-level Reasoning of Large Language Models via a Tool-based Multi-hop Tabular Question Answering Task

Nick Ferguson, Alan Bundy, Kwabena Nuamah

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a structured multi-hop question answering task to evaluate the meta-level reasoning abilities of large language models, focusing on tool selection and intermediate reasoning steps, revealing strengths and limitations in their reasoning and numeracy skills.

Contribution

It presents a novel task for assessing meta-level reasoning in LLMs, emphasizing tool use and intermediate step analysis, advancing understanding beyond final answer accuracy.

Findings

01

LLMs show good meta-level reasoning on the task

02

N-shot prompting has little impact on accuracy

03

LLMs exhibit poor numeracy skills

Abstract

Recent advancements in Large Language Models (LLMs) are increasingly focused on "reasoning" ability, a concept with many overlapping definitions in the LLM discourse. We take a more structured approach, distinguishing meta-level reasoning (denoting the process of reasoning about intermediate steps required to solve a task) from object-level reasoning (which concerns the low-level execution of the aforementioned steps.) We design a novel question answering task, which is based around the values of geopolitical indicators for various countries over various years. Questions require breaking down into intermediate steps, retrieval of data, and mathematical operations over that data. The meta-level reasoning ability of LLMs is analysed by examining the selection of appropriate tools for answering questions. To bring greater depth to the analysis of LLMs beyond final answer accuracy, our task…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

1. The meta-level reasoning capability in the paper is defined as the high-level planning capability. The question design is highly consistent with this design goal. The proposed question construction format can be easily adapted to domains other than geopolitical. 2. The dataset provides the correct sequence of essential actions, which enables assessment of partial correctness in LLM responses, thus offering a more nuanced evaluation of reasoning capabilities than binary success/failure metri

Weaknesses

1. The task combines high-level planning with tool use, where LLMs need to understand tool call rules from the prompt. LLM performance depends on both planning and tool-calling capabilities. Therefore, the evaluation may be influenced by confounding factors. 2. The presented metrics are precise and recall count for the content overlap between the LLM-generated action and ground-truth essential actions. For some cases, the action sequence is order-dependent, meaning different orderings of the sa

Reviewer 02Rating 2Confidence 3

Strengths

(1) The paper offers a deep exposition and evaluation of meta-level reasoning—specifically, a model’s abilities in task decomposition and task planning. (2) Abstract concepts are described clearly, and the overall writing is fluent and easy to follow.

Weaknesses

(1) A fundamental concern: in modern LLMs, it is difficult to cleanly separate “meta-level” and “object-level” reasoning in the symbolic-AI sense. For complex tasks in real world, most steps blend both forms of reasoning. The paper’s “first decompose, then execute” setup seems best suited to relatively simple multi-hop tasks (e.g., retrieval-augmented or calculation-augmented), and may not generalize to richer scenarios. (2) The evaluation set is too narrow. Although the paper notes that the ap

Reviewer 03Rating 2Confidence 4

Strengths

1. Problem framing: Clear split between object-level and meta-level, which is useful for capability-wise diagnosis and future extensions. 2. Beyond accuracy, the paper reports precision/recall (±1σ) and an Err. rate (share of outputs with at least one tool-call error), capturing both process robustness and final correctness. 3. Comparisons across 0/1/3-shot, error vs. no-error conditions, and tool-set ablations (all tools vs. data-retrieval only) reveal more perspectives. 4. Clarity. Method

Weaknesses

1. The benchmark relies on templated prompts and relatively narrow tools APIs, which is too simple compared to current practice, e.g., richer coding/spreadsheet/enterprise interfaces such as in SheetBrain: A Neuro-Symbolic Agent for Accurate Reasoning over Complex and Large Spreadsheets (https://arxiv.org/abs/2510.19247). 2. The study omits stronger/ newer models (e.g., GPT-5, Gemini 2.5 Pro) and larger mainstream open models (e.g., newer Qwen3 variants), making it hard to gauge the field’s cur

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods