Reasoning-Intensive Regression

Diane Tchuindjo; Omar Khattab

arXiv:2508.21762·cs.CL·May 4, 2026

Reasoning-Intensive Regression

Diane Tchuindjo, Omar Khattab

PDF

TL;DR

This paper introduces a new benchmark for reasoning-intensive regression tasks using LLMs, highlighting the limitations of existing methods and proposing MENTAT, a lightweight ensemble approach that significantly improves performance.

Contribution

The paper establishes a new RiR benchmark, evaluates existing methods, and proposes MENTAT, a novel ensemble technique that outperforms prompting and fine-tuning baselines.

Findings

01

MENTAT achieves up to 65% improvement over baselines.

02

Prompting and fine-tuning often struggle in RiR tasks.

03

The benchmark reveals challenges in current LLM approaches for RiR.

Abstract

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks such as sentiment or similarity analysis, RiR often appears instead in ad-hoc applications such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with…

Tables3

Table 1. Table 1. Performance comparison across Mathematical Error Detection, Pairwise RAG Comparison, and Essay Grading using GPT-4.1 and GPT-5 as our models. Each entry is the average of three independent runs on a test set of size 750 750 . Total training sizes are 100 100 and 500 500 (train/val combined). Ablations: MENTAT Prompt uses only error-driven prompt refinement on training data starting from a basic prompt ; MENTAT -Avg shows performance when replacing the trained MLP with averaging. We remark here that NeoBERT obtains an average NMSE and CCC of 0.60 0.60 and 0.66 0.66 respectively, on a training regime of 1500 1500 ( 1000 1000 training + 500 500 validation) on Pairwise RAG Comparison. That is, NeoBERT needs much more data on this task to lead to good performance, but it can be achieved. This table along with additional reporting of standard deviation can be found in Table 3 in the appendices.

LM	Method	Math Errors				Pairwise RAG				Essay Grading
		NMSE $↓$		CCC $↑$		NMSE $↓$		CCC $↑$		NMSE $↓$		CCC $↑$
		100	500	100	500	100	500	100	500	100	500	100	500
Main Methods
NeoBERT	Gradient Descent	1.05	1.01	0.02	0.06	1.44	1.02	0.02	0.10	1.03	0.91	0.19	0.65
GPT-4.1	Basic Prompt	1.59	1.59	0.36	0.36	2.18	2.18	0.47	0.47	0.75	0.75	0.63	0.63
	Detailed Prompt	1.13	1.13	0.52	0.52	2.20	2.20	0.47	0.47	0.73	0.73	0.65	0.65
	MENTAT $_{Basic Prompt}$	0.87	0.76	0.51	0.49	0.77	0.80	0.50	0.52	0.54	0.53	0.70	0.68
GPT-5	Basic Prompt	0.77	0.77	0.66	0.66	2.25	2.25	0.35	0.35	1.31	1.31	0.42	0.42
	Detailed Prompt	0.78	0.78	0.69	0.69	2.18	2.18	0.31	0.31	1.53	1.53	0.40	0.40
	MENTAT $_{Basic Prompt}$	0.52	0.42	0.72	0.78	1.07	0.93	0.36	0.33	0.64	0.67	0.59	0.55
Ablations
GPT-4.1	MENTAT Prompt	1.39	1.29	0.45	0.48	2.00	1.69	0.45	0.48	0.61	0.71	0.68	0.66
	MENTAT-Avg	1.00	1.01	0.52	0.52	1.82	1.48	0.48	0.51	0.57	0.63	0.69	0.68
	GEPA	1.04	1.01	0.49	0.54	2.16	2.40	0.44	0.43	0.79	0.81	0.63	0.63
GPT-5	MENTAT Prompt	0.66	0.58	0.66	0.72	1.43	1.95	0.33	0.30	0.74	0.70	0.57	0.54
	MENTAT-Avg	0.59	0.51	0.68	0.75	1.31	1.83	0.35	0.32	0.69	0.67	0.57	0.55
	GEPA	0.78	0.63	0.68	0.69	2.48	2.29	0.28	0.28	1.01	1.01	0.42	0.44

Table 2. Table 2. Performance on the Instruction Following task using the gpt-oss-20b model. Each entry is the average of three independent runs on a test set of size 2000. Total training configuration uses 500 training and 500 validation samples. Ablations: MENTAT Prompt uses only error-driven prompt refinement on training data; MENTAT -Avg shows performance when replacing the trained MLP with averaging. The subscripts, basic prompt and detailed prompt, are what we use an the initial prompt in the MENTAT framework. Moreover, values within the parenthesis represent standard deviations.

Ablations
LM	Method	Instruction Following
LM	Method	NMSE $↓$	CCC $↑$
NeoBERT	Gradient Descent	1.08 (0.07)	0.36 (0.04)
GPT-OSS-20B	RL Fine-Tuning	1.51 (0.03)	0.37 (0.02)
GPT-OSS-20B	Basic Prompt	1.18 (0.00)	0.32 (0.00)
	Detailed Prompt	1.16 (0.00)	0.33 (0.00)
	MENTAT $_{Basic Prompt}$	0.95 (0.09)	0.42 (0.01)
	MENTAT $_{Detailed Prompt}$	0.90 (0.04)	0.43 (0.00)
GPT-OSS-20B	MENTAT $_{Basic Prompt}$ Prompt	1.25 (0.05)	0.35 (0.01)
	MENTAT $_{Basic Prompt}$ -Avg	1.06 (0.04)	0.38 (0.02)
	MENTAT $_{Detailed Prompt}$ Prompt	1.24 (0.13)	0.36 (0.01)
	MENTAT $_{Detailed Prompt}$ -Avg	1.09 (0.06)	0.39 (0.02)
	GEPA	1.06 (0.02)	0.46 (0.01)

Table 3. Table 3. Representation of Table 1 with additional reporting of standard deviation.

LM	Method	Math Errors				Pairwise RAG				Essay Grading
		NMSE $↓$		CCC $↑$		NMSE $↓$		CCC $↑$		NMSE $↓$		CCC $↑$
		100	500	100	500	100	500	100	500	100	500	100	500
Main Methods
NeoBERT	Gradient Descent	1.05 (0.03)	1.01 (0.02)	0.02 (0.01)	0.06 (0.04)	1.44 (0.63)	1.02 (0.02)	0.02 (0.01)	0.10 (0.01)	1.03 (0.17)	0.91 (0.38)	0.19 (0.09)	0.65 (0.09)
GPT-4.1	Basic Prompt	1.59 (0.03)	1.59 (0.03)	0.36 (0.02)	0.36 (0.02)	2.18 (0.01)	2.18 (0.01)	0.47 (0.00)	0.47 (0.00)	0.75 (0.00)	0.75 (0.00)	0.63 (0.00)	0.63 (0.00)
	Detailed Prompt	1.13 (0.01)	1.13 (0.01)	0.52 (0.00)	0.52 (0.00)	2.20 (0.04)	2.20 (0.04)	0.47 (0.01)	0.47 (0.01)	0.73 (0.01)	0.73 (0.01)	0.65 (0.00)	0.65 (0.00)
	MENTAT $_{Basic Prompt}$	0.87 (0.03)	0.76 (0.01)	0.51 (0.01)	0.49 (0.01)	0.77 (0.06)	0.80 (0.04)	0.50 (0.02)	0.52 (0.03)	0.54 (0.01)	0.53 (0.04)	0.70 (0.00)	0.68 (0.01)
GPT-5	Basic Prompt	0.77 (0.00)	0.77 (0.00)	0.66 (0.00)	0.66 (0.00)	2.25 (0.04)	2.25 (0.04)	0.35 (0.01)	0.35 (0.01)	1.31 (0.00)	1.31 (0.00)	0.42 (0.00)	0.42 (0.00)
	Detailed Prompt	0.78 (0.05)	0.78 (0.05)	0.69 (0.01)	0.69 (0.01)	2.18 (0.03)	2.18 (0.03)	0.31 (0.01)	0.31 (0.01)	1.53 (0.01)	1.53 (0.01)	0.40 (0.00)	0.40 (0.00)
	MENTAT $_{Basic Prompt}$	0.52 (0.00)	0.42 (0.02)	0.72 (0.00)	0.78 (0.02)	1.07 (0.02)	0.93 (0.07)	0.36 (0.06)	0.33 (0.07)	0.64 (0.06)	0.67 (0.04)	0.59 (0.03)	0.55 (0.04)
Ablations
GPT-4.1	MENTAT Prompt	1.39 (0.00)	1.29 (0.00)	0.45 (0.00)	0.48 (0.00)	2.00 (0.16)	1.69 (0.21)	0.45 (0.02)	0.48 (0.02)	0.61 (0.04)	0.71 (0.08)	0.68 (0.00)	0.66 (0.01)
	MENTAT-Avg	1.00 (0.00)	1.01 (0.00)	0.52 (0.00)	0.52 (0.00)	1.82 (0.17)	1.48 (0.20)	0.48 (0.02)	0.51 (0.03)	0.57 (0.03)	0.63 (0.06)	0.69 (0.00)	0.68 (0.00)
	GEPA	1.04 (0.09)	1.01 (0.03)	0.49 (0.03)	0.54 (0.01)	2.16 (0.15)	2.40 (0.05)	0.44 (0.01)	0.43 (0.02)	0.79 (0.07)	0.81 (0.03)	0.63 (0.03)	0.63 (0.01)
GPT-5	MENTAT Prompt	0.66 (0.03)	0.58 (0.01)	0.66 (0.09)	0.72 (0.01)	1.43 (0.08)	1.95 (0.49)	0.33 (0.05)	0.30 (0.06)	0.74 (0.07)	0.70 (0.07)	0.57 (0.04)	0.54 (0.05)
	MENTAT-Avg	0.59 (0.05)	0.51 (0.03)	0.68 (0.09)	0.75 (0.00)	1.31 (0.03)	1.83 (0.43)	0.35 (0.06)	0.32 (0.07)	0.69 (0.06)	0.67 (0.07)	0.57 (0.03)	0.55 (0.05)
	GEPA	0.78 (0.03)	0.63 (0.08)	0.68 (0.02)	0.69 (0.00)	2.48 (0.00)	2.29 (0.03)	0.28 (0.00)	0.28 (0.02)	1.01 (0.11)	1.01 (0.08)	0.42 (0.02)	0.44 (0.01)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

\setcctype

by

Reasoning-Intensive Regression

Diane Tchuindjo

Massachusetts Institute of TechnologyCambridgeMAUSA

[email protected]

and

Omar Khattab

Massachusetts Institute of TechnologyCambridgeMAUSA

[email protected]

(2026)

Abstract.

AI researchers and practitioners increasingly apply large language models (LLMs) to what we call reasoning-intensive regression (RiR), i.e., deducing subtle numerical scores from text. Unlike standard language regression tasks such as sentiment or similarity analysis, RiR often appears instead in ad-hoc applications such as rubric-based scoring, modeling dense rewards in complex environments, or domain-specific retrieval, where much deeper analysis of context is required while only limited task-specific training data and computation are available. We cast four realistic problems as RiR tasks to establish an initial benchmark, and use that to test our hypothesis that prompting frozen LLMs and fine-tuning Transformer encoders via gradient descent will both often struggle in RiR. We then propose MENTAT, a simple and lightweight method that combines batch-reflective prompt optimization with neural ensemble learning. MENTAT achieves up to $65\%$ improvement over both baselines, though substantial room remains for future advances.111Data: https://huggingface.co/datasets/dianetc/rir-paper-data

††copyright: acmlicensed††journalyear: 2026††copyright: cc††conference: ACM Conference on AI and Agentic Systems; May 26–29, 2026; San Jose, CA, USA††booktitle: ACM Conference on AI and Agentic Systems (CAIS ’26), May 26–29, 2026, San Jose, CA, USA††doi: 10.1145/3786335.3813139††isbn: 979-8-4007-2415-2/2026/05

1. Introduction

Despite fast progress in adapting large language models (LLMs) for downstream problems, lightweight methods for teaching LLMs to do even standard natural-language regression tasks remain surprisingly elusive (Lukasik et al., 2024, 2025; Tang et al., 2024; Song et al., 2024; Song and Bahri, 2025). These tasks, like sentiment analysis, semantic similarity, and document ranking, involve predicting a score $y\in\mathbb{R}$ from a natural-language string. On these problems, applying straightforward supervised learning to pretrained Transformer encoders such as BERT (Devlin et al., 2019) has been shown to perform competitively with much larger decoder-only LLMs (Lukasik et al., 2025), even with sophisticated fine-tuning methods.

We investigate what we call Reasoning-Intensive Regression (RiR), a fuzzy but growing subset of natural-language regression in which processing the text in each instance demands sequential deduction or deep analysis, rather than shallow identification of features. Unlike simpler regression tasks, RiR problems call for explicit step-by-step problem decomposition or reasoning, where the system produces intermediate sequences of steps like tokens $\langle r_{1},...,r_{t}\rangle\in\Sigma^{*}$ before committing to a prediction (Merrill and Sabharwal, 2024). See Figure 2 for a breakdown of regression problems into three levels of complexity: feature-based, semantic analysis, and reasoning-intensive, inspired by Su et al. (2025)’s analysis of retrieval tasks.

These types of applications are emerging rapidly in both research and practice, e.g., to produce scores for ad-hoc applications that process customer calls, student essays, rubric-based LLM generation, or instruction-based query–document relevance (MacDonald, 2024; Es et al., 2024; Su et al., 2025; Thakur et al., 2025). In parallel, the same scoring paradigm is being scaled in recent efforts toward general-purpose chain-of-thought reward models (Kimi Team, 2025; Ankner et al., 2024), but these typically assume orders-of-magnitude more labels and compute (e.g., hundreds of thousands of labels in K2) than the lightweight application-specific regimes that are far more common in the long tail.

We establish an initial benchmark for RiR by casting four realistic tasks as regression problems that demand varying levels of reasoning: predicting the proportion of a long mathematical deduction up to the first erroneous statement, determining the extent to which an LLM can follow highly composite instructions, predicting the degree to which the response of one Retrieval-Augmented Generation (RAG) system is better than another, and grading student essays on supplied topics. We then identify two practical constraints of downstream RiR applications. Such applications typically offer only very small training sets and are limited to lightweight computations (LLM inference, lightweight prompt optimization, and fine-tuning of medium-sized networks such as small Transformers), precluding approaches like large-scale reinforcement learning for LLMs (DeepSeek-AI, 2025; Kimi Team, 2025).

We ask: Are there effective methods that are data- and compute-efficient for tackling ad-hoc reasoning-intensive regression problems? We hypothesize that what makes RiR problems especially challenging is that they combine the reasoning need for deep analysis of each individual task instance with the regression challenge of learning to produce precise, calibrated, and well-ranked scores from very little data. As illustrated in Figure 1, standard prompt engineering techniques struggle with the high precision needed for learning to approximate a statistical distribution, while approaches that bypass LLM-based reasoning, e.g., training small Transformer encoders, often fail to truly learn RiR problems and instead seek to “hack” the regression loss function by finding degenerate approximations (e.g., collapsing to a small range of scores).

We propose Mistake-Aware prompt Evolver with Neural Training And Testing (MENTAT), a simple and lightweight method that combines iterative prompt optimization with neural regression. Rather than relying on LLMs to produce precise numerical predictions directly, MENTAT uses an iterative error-driven prompt evolution process. Starting with even just a very basic prompt, the LLM analyzes its own prediction errors in large batches, identifies patterns of its poor performance, and then refines the prompt based on that. After few iterations, MENTAT trains a simple aggregation MLP to reduce multiple rollouts from the LLM-discovered prompt into a final prediction. MENTAT delivers consistent improvements in quality, but nonetheless leaves large headroom on many of the RiR settings we define. Our contributions are threefold:

(1)

Problem Formulation. We formalize Reasoning-Intensive Regression (RiR) as a distinct subclass of natural-language regression, distinguishing it from feature-based and semantic-analysis tasks by its requirement for explicit multi-step reasoning before score prediction. 2. (2)

Benchmark. We establish an initial RiR benchmark comprising four tasks of varying reasoning intensity: mathematical error detection, instruction following, pairwise RAG comparison, and essay grading. We advocate for Concordance Correlation Coefficient (CCC) as a more appropriate metric than NMSE for RiR evaluation, as it captures both ranking quality and calibration. 3. (3)

Method and Analysis. We propose MENTAT, a lightweight method combining batch-reflective prompt optimization with neural ensemble learning. Through systematic evaluation, we demonstrate that neither prompting frozen LLMs nor fine-tuning Transformer encoders alone suffices for RiR, while MENTAT achieves up to 65% improvement over baselines, though substantial headroom remains.

The remainder of the study is as follows: Section 2 describes how we translate four problems into RiR tasks and Section 3 introduces MENTAT. Section 4 presents our evaluation methodology, including the details of our baselines, and the results. The paper concludes with Section 5, which synthesize our findings and discuss implications for future research. An extended discussion of related work is given in the Appendix A.

2. Benchmarking RiR

We collect four tasks for Reasoning-Intensive Regression, chosen to span a range of reasoning intensities (from pairwise preference judgment to mathematical error localization) and label distributions (bimodal, narrow, tight clustering). Refer to Figure 3 for dataset distributions. We view these as representative rather than exhaustive.

•

Mathematical Error Detection requires precise logical reasoning and stepwise analysis, while also stressing the fact that LLMs are known to struggle with precisely estimating simple properties like text length.

•

Instruction Following evaluates how well a response satisfies a set of fine-grained requirements, and expects models to produce calibrated scalar judgments.

•

Pairwise RAG Comparison asks models to perform nuanced judgment and contextual understanding.

•

Essay Grading serves as a reference point, requiring semantic understanding where encoders like BERT might already perform well with a reasonable amount of fine-tuning data.

These tasks can be framed as instances of LLM-as-a-Judge evaluation, an area where practitioners have identified binary or coarse scoring as a critical bottleneck at scale, finding that it “collapses under real-world complexity” by hiding the distinction between nearly correct and entirely wrong responses (Sinha et al., 2025). RiR formalizes this observation: producing calibrated continuous judgments requires reasoning that goes beyond what coarse classification demands. Below we identify the appropriate metric for evaluating RiR methods and further describe the tasks above.

Regression Metrics

Normalized Mean Square Error (NMSE) is a common metric for reporting regression performance: $\sum\limits_{i}^{n}(y_{i}-\hat{y}_{i})^{2}/\sum\limits_{i}^{n}(y_{i}-\bar{y})^{2}$ , where $n$ is the size of the dataset, $\hat{y}_{i}$ is a prediction, $y_{i}$ the corresponding ground truth value, and $\bar{y}$ is the mean.

But distance-based metrics are inadequate for typical RiR problems; RiR systems can artificially lower their NMSE simply by avoiding “risky” predictions at the extremes. This can be seen in Figure 1 earlier, particularly in comparing the fine-tuned NeoBERT model (Breton et al., 2025) against detailed (human-crafted) prompting. Following Figure 1, if we were to rely on NMSE, detailed prompting for GPT-5 would not appear to substantially outperform NeoBERT ( $0.81$ vs. $1.01$ ), and this gap would be even reversed for weaker LLMs. Examining the distribution of predictions reveals that NeoBERT “hacked” the loss function by learning a collapsed distribution, while the prompted LLM actually shows substantial signs of ranking the inputs correctly.

This can be captured in a Concordance Correlation Coefficient (CCC) of $0.01$ for NeoBERT versus a CCC of $0.69$ for detailed prompting. We thus suggest the use of the CCC as an additional, and perhaps more appropriate, RiR metric. CCC measures both correlation and agreement, defined as $\frac{2\rho\sigma_{y}\sigma_{\hat{y}}}{\sigma_{y}^{2}+\sigma_{\hat{y}}^{2}+(\mu_{y}-\mu_{\hat{y}})^{2}}$ , where $\rho$ is the Pearson correlation coefficient between predictions $\hat{y}$ and ground truth $y$ , $\sigma_{y}$ and $\sigma_{\hat{y}}$ are their respective standard deviations, and $\mu_{y}$ and $\mu_{\hat{y}}$ are their means. CCC penalizes systematic bias and rewards predictions that maintain the natural variance of the distribution.

Detecting Mathematical Errors

We derive a dataset for predicting the fraction of a mathematical solution up to the first erroneous reasoning step, given a problem and incorrect solution in LaTeX, from ProcessBench (Zheng et al., 2025). To effectively do this, a model must systematically reason formally about math steps rather than relying on probabilistic heuristics, but it must also be good at estimating relative lengths and inferring the boundaries of the steps in a calibrated way.

To convert the original classification task into a regression problem, we first filter out problems with correct solutions or final answers. We then merge all solution steps into a single continuous text $T=s_{1}\|s_{2}\|\cdots\|s_{n}$ (here $\|$ denotes concatenation). Next, for a solution with error at step $k$ , the regression score $R$ is $10\times(\sum\limits_{i=1}^{k-1}|s_{i}|+\frac{1}{2}|s_{k}|)/|T|$ where $|s_{i}|$ denotes the length of step $i$ , and $|T|$ is the total length of the concatenated solution. See an example entry in Appendix G.

Instruction Following

We derive a task from the WildIFEva corpus (Lior et al., 2025) that targets instruction-following in long-form generation. Each example consists of: (i) a user task prompt; (ii) a list of atomic requirements (the decomposition); (iii) a model answer produced by Llama-3.1-8B (zero-shot); and (iv) per-requirement satisfaction scores originally produced by Llama-3.1-70B acting as an automatic judge. The goal is to predict a single continuous label $y\in[0,1]$ that reflects the overall degree to which the answer adheres to the decomposed instructions. More precisely, for each decomposition instance, the judge produced a probability-like score $s_{i}\in[0,1]$ for each requirement $r_{i}$ , $i=1,\dots,K$ . We then use the harmonic mean of these scores as our overall judgment, emphasizing the need to adhere well to all task requirements. To test instruction following, we do not expose the decomposition to NeoBERT or an LLM; instead, they are only given the task and model answer and must infer the overall score.

Pairwise RAG Comparison

We derive a dataset for comparing two LLM outputs on a scale from the RAG-QA evaluations (Han et al., 2024). Each query $q\in\mathcal{Q}$ has responses $A_{1},A_{2}$ and a target comparative score from $-2$ to $2$ representing the average annotation of three human judges, who were instructed to assess response helpfulness, truthfulness, and completeness. Here, positive scores means that $A_{1}$ is better (and vice versa). This task partially aligns with RiR as judging the outputs and comparing them in light of each query often requires nuanced judgment.

Essay Grading

We lastly use an essay grading dataset (Crossley et al., 2023), where each entry contains among other features an essay prompt, a student (grade 8–12) response, associated demographic information, and an overall score between $1$ and $5$ . Although Essay Grading is simpler than the rest, it serves as a reference point for the other RiR tasks.

We evaluate these tasks using two proprietary LLMs (GPT-4.1, GPT-5) across three tasks. For Instruction Following, we utilize an open-source model (gpt-oss-20b) for reproducibility and generalization validation.

3. MENTAT

MENTAT combines two simple ideas, depicted in Figure 4: it allows the LLM itself to reflect in batches to incrementally adjust its own prompt, and it aggregates multiple rollouts from the optimized LLM system with a simple trained MLP.

3.1. Phase 1: Prompt Evolution

MENTAT’s first step is to make sure that the LLM prompt reflects both local instructions for reasoning about each input and global guidance about the distribution of ground-truth scores. Though any approach for prompt optimization can be used here, e.g., MIPRO (Opsahl-Ong et al., 2024) or GEPA (Agrawal et al., 2025), through preliminary experiments we identified two special properties in RiR tasks that call for different design choices.

First, performing rollouts with powerful reasoning models can be expensive and slow, when compared to standard LLMs, for which existing optimizers were built. To remain within the lightweight constraints of typical RiR tasks, a suitable prompt evolution stage would have to minimize both the number of rollouts performed with the LLM and the number of inherently sequential stages or iterations of optimization. Second, RiR tasks require attention to distributional properties, calibration, variance matching, and avoiding collapse to mean predictions, beyond per-example accuracy. This is because MENTAT’s aggregation design demonstrates that it can be easy to turn a well-calibrated system into one that has low pointwise error, but the reverse is not necessarily true.

This motivates us to test an exceedingly simple reasoning-based technique for optimizing LLM systems that contain a single prompt.222We leave extending this method to multi-stage LLM programs and conducting an extensive comparison of different prompt optimization strategies to future work. While batch-based prompt optimization has been extensively explored in prior work (Pryzant et al., 2023; Ye et al., 2024), we focus on combining it with neural aggregation specifically for regression tasks, using CCC alongside NMSE to guide prompt selection and aggregator training. This simple design is inspired by human prompt engineering practice (Husain and Shankar, 2024).

Concretely, we proceed in a very small number of sequential iterations (three in our experiments). In each iteration, the work is highly parallelizable: we evaluate the current prompt on a shuffled sample of the training set, and then concatenate all of the rollouts for analysis by the same LLM. It is then asked to identify systematic errors by analyzing the worst-performing examples and to generate improved instructions. In each iteration, the LLM receives three key inputs: current instructions, performance analysis with detailed error patterns, and a formatted history of previous optimization attempts. This historical context prevents the method from cycling through previously unsuccessful approaches and enables progressive refinement. At the end of this process, the best-performing prompt (via NMSE or CCC) is selected on a separate validation set.

In our evaluation, to stress MENTAT, we start from a deliberately basic prompt for each task, to reflect a more challenging and informative setting.333Examples of the basic vs. the detailed prompts used for the four tasks can be found in Appendices J and H), respectively. They differ in the inclusion of detailed procedural steps, calibration guidance, and/or domain-specific heuristics that human experts may decide to include. Note also that this iterative prompt evolution follows a single optimization trajectory. In principle, MENTAT could employ multiple random restarts, which could be parallelized to explore diverse regions of the prompt space. However, we focus on single-trajectory optimization both for computational efficiency and algorithmic simplicity.

3.2. Phase 2: Multi-Rollout Generation with Neural Aggregation

Using the best LLM-discovered prompt from Phase 1, MENTAT generates multiple independent predictions for each example. The multi-rollout approach captures the inherent uncertainty in LLM predictions, as each rollout can reason independently, and provides richer signal for the subsequent neural aggregation phase. In practice, we set this to three rollouts per example.

We train a small Multi-Layer Perceptron (MLP) to combine rollout predictions. The aggregator ensures order invariance by sorting rollout predictions, incorporates statistical features (mean, standard deviation, min, max), and is optimized for a combination of the CCC and NMSE loss functions. Overall, this method builds on self-consistency (Wang et al., 2023) and best-of-N voting (Stiennon et al., 2020; Snell et al., 2024), but differs by training a lightweight aggregator that learns task-specific weighting of rollout statistics rather than using fixed aggregation rules.

4. Evaluation

In our main experiments, we define two standard baseline approaches for RiR problems: fine-tuning a small Transformer encoder and prompting an LLM, and use these two to understand the relative merits of our method MENTAT and to develop a series of ablation experiments. Additionally, we compare against Agrawal et al. (2025), a recent reflective prompt optimization method, to situate MENTAT relative to modern prompt optimizers.

4.1. Baseline: Fine-tuning a Transformer Encoder

We formulate RiR as supervised regression using a 250M-parameter NeoBERT model. The architecture processes minimally formatted text sequences (e.g., combining problem statements with solutions for math errors, augmented with domain-specific prompts).

Inputs are tokenized using NeoBERT’s byte-level BPE tokenizer, truncated or padded to 1024 tokens, and passed through the pretrained encoder. The model extracts representations from the [CLS] token, applies dropout regularization ( $p=0.2$ ), and uses linear projection for scalar predictions. The optimization objective minimizes weighted NMSE and CCC using AdamW (Loshchilov and Hutter, 2019). This architecture requires only prompt templating beyond standard fine-tuning, with hyperparameters detailed in Appendix D.1.

4.2. Baseline: Prompting a Large Language Model

We employ Chain-of-Thought style prompting to encourage frozen LLMs to perform explicit reasoning through step-by-step token generation. Our evaluation uses two proprietary models with different reasoning capabilities: GPT-4.1 (non-reasoning) and GPT-5 (reasoning) across three tasks (Mathematical Error Detection, Pairwise RAG Comparison, and Essay Grading). For the Instruction Following task, we employ gpt-oss-20b, an open-source model, to demonstrate that our method generalizes beyond proprietary systems and to provide more easily reproducible baselines for the community. We note that evaluating MENTAT on even smaller open-source models remains an important direction for validating on-premise deployability. The detailed prompts for all tasks help guide the decomposition of complex inputs and the templates can be found in Appendix H.

This approach is motivated by several practical advantages. Frozen LLMs can act as a unified interface across various natural language tasks, with very little to no training data. This is especially valuable in RiR domains where annotated datasets are often scarce. Utilizing a shared, unified, and amortized infrastructure (i.e., LLM servers) enables us to deploy a single model across many tasks, significantly reducing the computational and financial overhead compared to training multiple specifiable models.

4.3. Additional Comparisons

We supplement our main baselines with two additional comparisons that probe different axes of the RiR problem. A state-of-the-art reasoning reward model tests whether models explicitly trained for preference judgment transfer to our continuous regression setting, while reinforcement learning (RL) fine-tuning allows us to test whether policy gradient methods can effectively optimize for RiR objectives given comparable compute. Each comparison is restricted to a single task: the reasoning reward model to pairwise RAG comparison, where preference modeling is most natural, and RL fine-tuning to instruction following, the task employing an open-source model whose weights can be updated.

4.3.1. Reasoning Reward Model

Pairwise preference judgment, the core task underlying reward modeling for RLHF, is a canonical example of reasoning-intensive regression. Recent work on Reasoning Reward Models (Chen et al., 2025) demonstrates that accurate preference judgments require explicit multi-step reasoning: inferring latent evaluation criteria, weighing trade-offs across dimensions (helpfulness, truthfulness, completeness), and grounding judgments in response content rather than surface features. Critically, their findings validate our RiR hypothesis: the bottleneck for preference modeling is reasoning capacity, not model scale, as reasoning-enhanced models outperform 70B+ parameter baselines despite being 5 $\times$ smaller.

This connection motivates two additions to our evaluation. First, we include a recent reasoning reward model as an additional baseline for the pairwise RAG comparison task, RM-R1-Qwen-14B. Second, since such models produce discrete preference labels rather than continuous scores, we extract regression targets via differences of log-probability $\hat{y}=4\cdot\sigma\bigl(\log p(A)-\log p(B)\bigr)-2$ , where $\sigma$ denotes the sigmoid function and $A$ , $B$ are preference tokens. This maps the model’s implicit preference strength to our target interval $[-2,2]$ , preserving the continuous nature of judgment confidence rather than discretizing to binary labels.

4.3.2. RL Fine-Tuning

We additionally evaluate RL-based fine-tuning using Group Relative Policy Optimization (Shao et al., 2024) on the instruction following task. We restrict this comparison to instruction following because it is the only task that employs an open-source model (gpt-oss-20b), whose weights can be updated; the remaining three tasks use proprietary models (GPT-4.1, GPT-5) that are accessible only through frozen inference APIs. We fine-tune a LoRA adapter on gpt-oss-20b via the Tinker API, sampling $k$ rollouts per problem at each training step and computing per-rollout rewards as $r=1-|y_{\text{pred}}-y_{\text{true}}|$ , where predictions are parsed from the model’s chain-of-thought output. Advantages are centered and normalized within each group of rollouts for the same problem, following the standard GRPO formulation, and applied to a PPO-clipped surrogate objective. Other training objectives where experimented with but garnered little benefits (see Appendix E).

4.4. Experimental Setup

Our experimental design evaluates MENTAT across four reasoning-intensive regression tasks using a structured approach. We test GPT-4.1 and GPT-5 on three tasks (Mathematical Error Detection, Pairwise RAG Comparison, and Essay Grading), while Instruction Following uses gpt-oss-20b to demonstrate generalization to open-source models.

For the three tasks using proprietary models, we employ $750$ test examples with results averaged across three independent runs. We evaluate under two training configurations ( $100$ and $500$ samples) that reflect real-world data constraints typical in ad-hoc RiR applications. For prompt optimization methods (including MENTAT), we use balanced train/validation splits of $50+50$ and $250+250$ samples; Phase 1 uses these for prompt evolution, and Phase 2 generates 3 rollouts per training sample for MLP training. For NeoBERT fine-tuning, we employ training-heavy splits of $50+50$ and $350+150$ samples to leverage the model’s supervised learning capabilities.

For Instruction Following, we use a single configuration with $500$ training, $500$ validation, and $2000$ test samples, reflecting the different data availability typical for this fundamental capability assessment. This larger test set enables more robust evaluation of the nuanced instruction-adherence requirements.

This experimental structure allows us to assess MENTAT’s effectiveness across different model capabilities (reasoning vs. non-reasoning), data regimes (limited vs. moderate training data), and model accessibility (proprietary vs. open-source), providing comprehensive validation of our approach for practical RiR applications.

4.4.1. Computation Cost

MENTAT’s computational costs comprise two phases. At inference time, each prediction requires 3 rollouts, resulting in $3\times$ token cost compared to single-pass prompting. However, all rollouts can be generated in parallel, so wall-clock latency remains approximately equivalent to a single rollout in parallelized deployment scenarios.

During optimization (Phase 1), MENTAT uses a fixed $3$ -iteration design. Each iteration evaluates the current prompt on all $n=250$ training and $n=250$ validation samples (parallelizable) and performs one reflection call analyzing the $\sqrt{250}\approx 16$ worst-performing examples. Including the initial baseline evaluation, this totals approximately $2{,}003$ ( $2\times 4\times 250+3$ ) LLM calls across $4$ sequential stages. GEPA’s “light” configuration converges after an average of $23$ sequential iterations (ranging $15$ – $34$ across runs). GEPA’s evolutionary search thus requires approximately $8\times$ more sequential rounds than MENTAT’s fixed design, providing MENTAT a substantial wall-clock advantage in parallelized deployments, though total token consumption may differ.

Phase 2 generates $3$ stochastic rollouts per training and validation example, adding $3\times 2n=1{,}500$ LLM calls, though predictions from Phase 1’s best iteration can serve as one rollout, reducing this to $2\times 2n=1{,}000$ . The MLP aggregator itself has negligible cost, containing only $8$ hidden units and training on $750$ rollout vectors ( $250$ samples $\times 3$ rollouts each).

Instruction Following Parameters

For the instruction following task, we select GRPO hyperparameters to match MENTAT’s optimization budget. MENTAT’s Phase 1 (prompt evolution) requires $4{,}003$ LLM calls across four sequential stages: one baseline evaluation and three refinement iterations, each evaluating $500$ training and $500$ validation examples plus one reflection call. Phase 2 adds $3$ stochastic rollouts per training and validation example, contributing an additional $2{,}000$ – $3{,}000$ calls depending on whether Phase 1 predictions are reused.

This yields an optimization budget of approximately $6{,}000$ – $7{,}000$ LLM calls (excluding test evaluation). To match this budget, we set batch size $b=12$ , group size $k=8$ , and train for $50$ steps, giving $12\times 8\times 50=4{,}800$ training rollouts. We evaluate on the full $500$ -example validation set at steps $20$ , $40$ , and $49$ (final), adding $3\times 500=1{,}500$ validation rollouts, for a total optimization budget of $6{,}300$ LLM calls, closely matching MENTAT’s compute envelope while allowing a direct methodological comparison.

4.5. Results

Our main evaluation results are reported in Table 1 and 2, demonstrating significant performance variations across methods and tasks. The results reveal distinct patterns in how different approaches handle reasoning-intensive regression problems, with MENTAT consistently outperforming baseline methods across most configurations. Beyond aggregate metrics, we analyze failure modes across methods: NeoBERT’s distribution collapse (1), GPT-5’s center-seeking behavior on pairwise RAG (5), and systematic quantization patterns in LLM outputs (Appendix B). We provide additional per-task qualitative error analysis in Appendix C.

Mathematical Error Detection Performance

On this task, fine-tuning NeoBERT achieves near-zero CCC scores across both training configurations and effectively collapsing to mean predictions as shown in Figure 1. In contrast, LLM-based approaches demonstrated substantial reasoning capabilities. GPT-4.1 with detailed prompting achieved CCC scores of $0.52$ (100-sample training) and maintained this performance at 500 samples. However, MENTAT with GPT-4.1 showed only modest improvements, reaching CCC scores of $0.51$ (100 samples) and $0.49$ (500 samples), representing approximately stable performance with slight variation. We hypothesize that GPT-4.1’s limited reasoning capabilities on this reasoning-intensive task made it difficult to understand its own errors and thus improve.

The most dramatic improvements can be seen with GPT-5. While detailed prompting with GPT-5 achieved strong baseline performance (CCC: 0.69, NMSE: 0.78), MENTAT with GPT-5 delivered substantial enhancements. In the 100-sample training regime, CCC improved by $4.3\%$ , while NMSE improved by $33.3\%$ . In the 500-sample training regime, CCC improved by $13\%$ , while NMSE improved by $46.2\%$ . These results indicate that MENTAT’s iterative prompt refinement and neural aggregation effectively leverage GPT-5’s reasoning capabilities while addressing the precision limitations inherent in direct LLM numerical prediction.

Instruction Following Performance

For the instruction following RiR task, the NeoBERT model achieved modest performance (CCC: $0.36$ , NMSE: $1.08$ ), while the gpt-oss-20b model with basic and detailed prompting showed similar limitations (CCC: $0.32$ – $0.33$ , NMSE: $1.16$ – $1.18$ ). RL fine-tuning via GRPO improved CCC to $0.37$ over frozen prompting, though at the cost of degraded calibration (NMSE: $1.51$ ). The elevated NMSE despite improved CCC suggests that RL fine-tuning sharpens relative discrimination between examples at the cost of absolute calibration. We emphasize that this comparison is conducted under matched compute budgets; GRPO with substantially greater rollout budgets could potentially close or reverse this gap, consistent with the scaling properties documented by Agrawal et al. (2025).

MENTAT demonstrated improvements across both initialization strategies, achieving CCC of $0.42$ – $0.43$ and NMSE of $0.90$ – $0.95$ . Notably, instruction following is the one task where GEPA surpasses MENTAT in correlation (CCC $0.46$ vs $0.43$ ), though MENTAT retains a clear calibration advantage (NMSE $0.90$ vs $1.06$ ). We attribute GEPA’s stronger correlation to the nature of the task: instruction following scores exhibit high variance and limited systematic structure, favoring GEPA’s unconstrained evolutionary search over MENTAT’s fixed $3$ -iteration design. The ablation results show that neural aggregation improves calibration (NMSE) over prompt evolution alone across both initialization strategies, though GEPA’s stronger CCC suggests that MENTAT’s fixed-iteration prompt evolution may underexplore the prompt space on this task, limiting the quality of the representations passed to the aggregator.

Pairwise RAG Comparison Performance

On the pairwise RAG comparison task, fine-tuning NeoBERT achieved very low CCC scores while appearing competitive on the NMSE metric by “hacking” the distribution. Surprisingly, GPT-4.1 demonstrated superior performance compared to GPT-5 on this task, in sharp contrast with the general trend observed in mathematical error detection. Detailed prompting with GPT-4.1 achieved CCC scores of $0.47$ across both training configurations, while GPT-5 detailed prompting resulted in lower CCC scores of $0.31$ .

Unlike math errors, instruction following, and essay grading tasks, correct decisions on the pairwise RAG benchmark often hinge on a few salient cues and short justifications. With chain-of-thought scaffolds on this task, we observe that GPT-5 systematically “overthinks,” resulting in predictions that concentrate near the center ([math] on the $[-2,2]$ margin) rather than faithfully spreading across the empirical label distribution. As shown in Figure 5, its variance is under-dispersed relative to ground truth, with more than half of examples yielding identical rollouts across three samples. Rollout correlations are very high, and the final numbers fall on a coarse grid (e.g., $\{-1,-\tfrac{1}{3},0,\tfrac{1}{3}\}$ ), all consistent with hedging.

By contrast, GPT-4.1 produces short, decisive judgments that remain closer to the dataset mean with greater spread and more frequent use of the extremes. Although GPT-4.1 rollouts are also correlated, the resulting distribution retains enough variance and calibration to yield substantially higher CCC. For pairwise RAG, GPT-5 tends toward the center and compresses its numeric range, degrading distributional fidelity (and thus CCC) even when NMSE remains similar.

We hypothesize that GPT-4.1’s superior performance on pairwise RAG comparison aligns with recent findings that large reasoning models often underperform on simpler tasks (Shojaee et al., 2025). These models initially find correct solutions but continue reasoning toward incorrect answers, suggesting that excessive sophisticated reasoning can sometimes be counterproductive. This hypothesis is supported by our observation that over half of GPT-5’s examples yield identical rollouts across three samples, with final scores clustering on a coarse grid rather than reflecting the task’s inherent variance.

RM-R1-Qwen-14B achieves an NMSE of 5.66 and CCC of 0.15, substantially underperforming even basic GPT-5 prompting (NMSE 2.25, CCC 0.35). Furthermore, its binary classification accuracy of 55.7% is only marginally above chance, indicating that the model struggles not only with regression magnitude but also with preference direction. We attribute this poor performance to a fundamental mismatch between RM-R1’s training objective, selecting a binary winner, and the RiR requirement of producing calibrated, fine-grained scores. We adapted RM-R1-Qwen-14B for regression by extracting the log-probabilities over the binary preference tokens and mapping them to continuous scores. However, these probabilities reflect the model’s confidence in a discrete choice rather than its assessment of preference magnitude. By the time RM-R1-Qwen-14B commits to generating its answer token, it has already reasoned through its Chain-of-Rubrics and reached a binary conclusion; the resulting probability distribution is highly concentrated (typically ¿0.95), providing little signal about whether the preferred response is “slightly better” versus “much better.” This result underscores that even reasoning-augmented reward models, despite their success on preference benchmarks like RewardBench, do not naturally generalize to reasoning-intensive regression settings.

Essay Grading Performance

Essay grading represented the least complex reasoning-intensive task, with NeoBERT achieving reasonable performance that improved substantially with additional training data. This aligns with the task’s characterization as requiring primarily semantic understanding rather than deep sequential reasoning. GPT-4.1 achieved strong baseline performance with detailed prompting (CCC: 0.65, NMSE: 0.73), while MENTAT provided meaningful improvements. In the 100-sample training regime, CCC improved by $7.7\%$ and NMSE improved by $26.0\%$ compared to detailed prompting. In the 500-sample training regime, CCC improved by $4.6\%$ and NMSE improved by $27.4\%$ . Notably, GPT-5 performance on essay grading showed surprisingly poor concordance compared to GPT-4.1, supporting the hypothesis from Section 4.5 that sophisticated reasoning models may over-deliberate on simpler tasks.

5. Conclusion

We investigated reasoning-intensive regression (RiR). Our empirical findings reveal tension: prompting leverages LLMs’ reasoning capabilities but produces quantized, imprecise outputs, while supervised fine-tuning for regression can often collapse without learning the task. We proposed MENTAT, a simple method that suggests that hybrid approaches may help address this tension through iteratively optimizing the prompts via batched error analysis combined with neural aggregation, achieving consistent improvements across several different RiR tasks.

However, our work opens several rich avenues for future research. The RiR framework we establish creates opportunities to more extensively evaluate sophisticated RL and prompt optimization techniques and develop RiR-adapted regression-aware fine-tuning methods (Lukasik et al., 2024; Chiang et al., 2025). Extending the benchmark beyond its current four tasks, particularly to domains such as clinical scoring, financial risk assessment, and code review, is a primary direction for future work, and would help clarify where the boundary between Level 2 and Level 3 regression lies empirically. Similarly, our open-source experiments currently use gpt-oss-20b and validating MENTAT on even smaller open-source models would strengthen the case for on-premise deployment in regulated settings such as finance or healthcare, where proprietary API access may be restricted. Moreover, our lightweight constraint focus also motivates exploring the efficiency-performance trade-offs in reasoning-intensive tasks. While reinforcement learning methods like Group Relative Policy Optimization (Shao et al., 2024) require thousands of rollouts that exceed the lightweight compute budgets typical of ad-hoc RiR deployment, our benchmark provides a testbed for developing more efficient alternatives as RiR datasets scale. Similarly, MENTAT’s $3\times$ inference cost increase highlights the need for systematic cost-benefit analysis across deployment scenarios, opening questions about adaptive rollout strategies and inference-time optimization that our tasks can help address. Lastly, we acknowledge the use of light assistance from generative AI tools, with all outputs reviewed and edited by the authors, in the preparation of portions of this paper.

Acknowledgments

This work used Expanse GPU at the San Diego Supercomputer Center (SDSC) through allocation CIS250733 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS; Boerner et al. 2023) program, which is supported by U.S. National Science Foundation grants $\#2138259$ , $\#2138286$ , $\#2138307$ , $\#2137603$ , and $\#2138296$ . This research was partly supported by Laude Institute.

Appendix A Extended Related Work

This appendix presents more expansive related work besides those covered in the main sections.

Ensemble Learning. Ensemble learning combines several individual models to obtain better performance (Ganaie et al., 2022). Classical methods include bagging, boostrapping, as well as stacking (Breiman, 1996a; Freund and Schapire, 1996; Wolpert, 1992; Breiman, 1996b). General methods include negative correlation learning, explicit and implicit ensembles, and homogeneous and heterogeneous ensembles (Liu and Yao, 1999; Srivastava et al., 2014; Breiman, 2001). More recent ensembling approaches for LLMs include LLM-Blender which seeks to pairwise compare from a set of $N$ different LLMs to discern subtle differences in output, then merges the top $K$ ranked outputs (Jiang et al., 2023a). DeePEn (Huang et al., 2024) is an ensembling method in which probability distributions from individual LLMs are translated into a “relative representation” space (to bypass the vocabulary discrepancies), making aggregation possible. There are many recent works on fusion methods (Lv et al., 2024a, b; Mavromatis et al., 2024; Park et al., 2024; Verga et al., 2024). Wang et al. (2024b) propose a fusion-of-experts method which fuses outputs of multiple (expert) models with complementary knowledge of the data distribution and casts it as a supervised learning problem. Prompt ensembling has also had great success in improving task accuracy (Jiang et al., 2023b; Pitis et al., 2023; Allingham et al., 2023; Khalifa et al., 2023; Si et al., 2023; Arora et al., 2022; Li et al., 2023) along with using Recursive Feature Machines (RFMs) for feature learning and aggregation for the steering of LLMs (Beaglehole et al., 2025).

Routing. Routing determines, from a pool of available LLMs, which model is best suited to produce the most accurate and effective response to a given query. Recent work includes RouteLLM (Ong et al., 2024), a framework for query routing between “strong” and “weak” LLMs and Zooter (Lu et al., 2023), a reward-guided routing approach that distills rewards from training queries into a routing function, enabling precise allocation of each query to the LLM with the relevant expertise.

Mixture-of-Experts. Mixture-of-Experts (MoEs) is a framework in architecture design, in which multiple specialized sub-models (“experts”) handle different parts of the input space (Jacobs et al., 1991; Jordan and Jacobs, 1993; Shazeer et al., 2017). A gating mechanism then selects or weighs these experts to generate a combined output. Recent work has sought to extend MoEs to LLMs, where several MLP experts are added after each multi-head self-attention module in the Transformer encoder and decoder blocks (Fedus et al., 2022; Chowdhery et al., 2023; Shen et al., 2023; Csordás et al., 2024). MoEs applications in LLMs have demonstrated demonstrated the ability to increase model size without a proportional rise in computational complexity, largely due to MoEs’ inherently sparse computations (Chen et al., 2024). Recently, the mixture-of-agents (Wang et al., 2024c) architecture has been proposed, in which multiple LLMs are stacked into sequential layers. Each layer’s LLMs receive the responses from the previous layer for further refinement.

Natural Language Regression. The two common approaches to solving natural language regression using decoder-based LLMs includes autoregressive regression (Vacareanu et al., 2024; Lukasik et al., 2024, 2025; Gruver et al., 2024; Liu and Low, 2023) and predictive head (Zhuang et al., 2023; Fernandes et al., 2023). The former directly predicts the numerical target as text (e.g., predict $112$ by predicting the tokens ‘1’, ‘1’, and ‘2’). The latter approach learns a separate head on encoded inputs.

Currently, work on advancing regression tends to focus on non-reasoning classical feature-based regression tasks, this includes OmniPred (Song et al., 2024) which introduces a framework for training language models as universal end-to-end regressors. They train a 200M parameter T5 encoder-decoder for the specific task of classical regression. Complementarily, Nguyen et al. (2024) introduces an “embed-then-regress” framework that leverages pre-trained language models’ string embedding capabilities to map arbitrary text inputs into fixed-dimensional vectors for downstream regression.

Fine-tuning large language models (LLMs) represents a potential approach for RiR, but recent work (Lukasik et al., 2024, 2025; Chiang et al., 2025) studying conventional regression problems, generally without any reasoning, demonstrates that decoder-only Transformers face fundamental optimization challenges for regression tasks due to the misalignment between cross-entropy loss (optimized for classification) and regression objectives. Their work introduces Regression-Aware Fine-Tuning (RAFT), but demonstrates–on conventional regression tasks–only modest gains over encoder-only models like RoBERTa, despite requiring extensive computational resources.

Other recent work has explored specific language-oriented regression tasks that involve reasoning, particularly for reward models in particular (Mahan et al., 2024; Ankner et al., 2024). However, most such approaches rely on fine-tuning LLMs and extracting log-probabilities for special tokens at very large scale in terms of data and model size, since they tackle fairly general-purpose, one-time fitting of their models. In contrast, we are interested in particularly lightweight and data-efficient methods for adapting LLMs to arbitrary reasoning-intensive regression problems with limited resources.

Appendix B Numerical Output Quantization in Large Language Models

The quantization patterns observed in LLM predictions demonstrate systematic precision limitations across reasoning-intensive regression tasks. Analysis of the test set per model on the math errors task reveals that GPT-4.1 exhibits $63.1\%$ clustering at $.00/.50$ decimal endings, while GPT-5 shows $86.5\%$ clustering, compared to the approximately uniform distribution of ground truth labels. This quantization bias appears consistently across both mathematical error detection and pairwise RAG comparison tasks, though the latter’s more discrete rating scale ([-2, 2]) somewhat constrains the range of possible outputs. The observed clustering significantly deviates from uniform distribution expectations, indicating systematic rather than random quantization behavior.

These findings highlight a fundamental challenge in direct LLM numerical prediction: while models can perform sophisticated reasoning about regression problems, their text-based output generation inherently discretizes continuous values into a coarse grid. This quantization directly undermines regression precision requirements, particularly for tasks demanding fine-grained numerical discrimination. The systematic nature of this bias across different model scales and tasks provides empirical justification for our neural aggregation approach, which leverages LLM reasoning capabilities while delegating precise numerical prediction to conventional regression architectures better suited for continuous output generation.

Appendix C Failure Modes of RiR Tasks

C.1. Mathematical Error Detection

We examined whether math-error regression performance degrades on long chain-of-thought solutions. As shown in Figure 8, absolute prediction error shows no strong dependence on solution length (average $\rho$ of $0.05$ ). Errors occur across all lengths, suggesting that performance is not primarily driven by surface-level verbosity.

We instead found through qualitative analysis of high-error cases that their is a distinct concentration in geometry and spatial-reasoning problems (e.g., grid-rectangle enumeration, line–region intersection). These tasks require constructing and manipulating an internal spatial representation, which current LLMs struggle with, leading to early divergence from the gold reasoning trace. This is in line with current finding on the difficult LLMs face with respect to geometric reasoning (Mouselinos et al., 2024). We present two problems with very large prediction errors in Figure 9.

C.2. Pairwise RAG Comparison

We analyze length bias in pairwise RAG comparison scoring by measuring how predicted preference scores vary with the length gap between the system and reference responses ( $\Delta$ length = sys_len − ref_len). As shown in Figure 10, human annotation scores already exhibit a l correlation with response length ( $\rho=0.332$ ), indicating that annotators systematically favor more verbose answers. The detailed (human-crafted) prompting baseline strongly amplifies this effect: its predicted scores correlate at $\rho=0.427$ with system length and $\rho=0.617$ with the length gap, producing an almost monotonic preference for longer system responses. This aligns with recent studies that have identified several biases that plague these LLMs, including position bias, verbosity bias, and self-enhancing bias (Zheng et al., 2023; Wang et al., 2024a). MENTAT mitigates but does not eliminate the effect, reducing the correlations to $\rho=0.375$ and $\rho=0.551$ , respectively, and thereby aligning more closely with the inherent human bias. Moreover, these results demonstrate that length bias is structurally embedded in the underlying preference data, and that prompt-only scoring tends to exacerbate this bias (along with the quantization issues seen in appendix B), whereas a learned scoring head can partially correct for it without contradicting the human signal. Moving forward, we argue that the community needs a broader class of RiR benchmarks that explicitly minimize such confounds otherwise progress on tasks requiring calibrated, high-granularity numerical judgments will remain limited.

Appendix D Training the MLP

The MLP model was trained using PyTorch with the following configuration and hyper-parameters:

•

Batch size: $32$ .

•

Number of epochs: $1000$ .

•

Optimizer: AdamW with learning rate of $0.0001$ .

•

Loss function: Weighted CCC and NMSE loss.

•

One hidden layer with dimension 8.

•

Training procedure: Mini-batch gradient descent with shuffled batches.

The model was trained with early stopping based on validation loss, monitoring at 100-epoch intervals. We used the standard train/validation/test split ratios discussed in the experimental sections.

Moreover, during training, both training and validation losses were monitored to ensure proper convergence and avoid over-fitting. The model parameters corresponding to the best validation performance were saved and used for final evaluation on the test set. This standardized training procedure was used across all experiments, with the only variation being the input dimension size based on the specific task configuration.

D.1. NeoBERT

The implementation details (model parameters) for NeoBERT is below,

•

No hidden layers; simple linear regression head that maps the $768$ -dimensional embedding directly to a single scalar

•

Optimizer: AdamW with default parameters.

•

Loss function: Weighted CCC and NMSE loss (0.8 and 0.2, respectively).

•

batch size: 16.

•

Training epochs: 10.

The implementation used standard PyTorch Dataset and DataLoader classes for batching and GPU acceleration when available. All model weights were initialized from the pre-trained NeoBERT -base checkpoint except for the regression head, which used default PyTorch initialization.

Appendix E CCC-Optimized RL Fine-Tuning

The per-item reward used in our main RL fine-tuning experiments (Section 4.3.2) optimizes a proxy: each rollout receives $r=1-|y_{\text{pred}}-y_{\text{true}}|$ , and GRPO advantages are normalized within groups for the same problem. However, this per-item reward is only loosely coupled to the evaluation metric (CCC), which measures population-level rank-order agreement and mean calibration across the full test set.

We explore an alternative formulation that directly optimizes a mini-batch estimate of CCC as the reward signal. For each training step, we sample a batch of $b$ problems with $k$ rollouts each. For a given rollout on problem $i$ , we construct a batch-level prediction vector by substituting that rollout’s prediction at position $i$ while using group means for all other positions $j\neq i$ . The CCC of this vector against the ground-truth vector serves as the rollout’s reward, shifted from $[-1,1]$ to $[0,1]$ . GRPO advantages are still computed within each problem’s group, but now reflect each rollout’s marginal contribution to batch-level correlation rather than pointwise accuracy alone.

This formulation requires large batch sizes for stable CCC estimates so we use $b=64$ with $k=4$ rollouts per problem, parallelized via asyncio. We train for $20$ steps with $3$ validation evaluations (steps $10$ , $15$ , $20$ ) on the full $500$ -example validation set, yielding a total optimization budget of $64\times 4\times 20+3\times 500=6{,}620$ LLM calls, comparable to MENTAT’s budget.

Across $3$ runs, CCC-optimized GRPO achieves a test CCC of $0.29\pm 0.01$ (and NMSE of $1.98\pm 0.03$ ) , below the per-item reward variant (see Table 2). We attribute this gap to the high variance of CCC estimated over $64$ points: while the batch-level reward better aligns with the evaluation metric in principle, the resulting gradient signal is noisier than the per-item alternative, requiring either substantially larger batches or more training steps to converge, both of which would exceed the matched compute budget.

Appendix F Analysis of MENTAT

Figure 11 shows rollout variance distributions for detailed (human-crafted) versus MENTAT-evolved prompts across three tasks. MENTAT’s prompt evolution consistently reduces variance on reasoning-intensive tasks, achieving a 30% reduction in mean variance on Mathematical Error Detection. This demonstrates that evolved prompts produce more stable reasoning patterns rather than merely providing noisy signals for the aggregator to smooth. However, non-trivial variance remains after evolution, enabling the neural aggregator to extract meaningful signal from rollout diversity. These findings reveal MENTAT’s complementary design: prompt evolution improves prediction reliability while neural aggregation refines these consistent signals into precise numerical outputs.

Appendix G Example Task Entries

G.1. Mathematical Error Detection

G.2. Instruction Following

G.3. Pairwise RAG Comparison

G.4. Essay Grading

Appendix H Detailed (Human Crafted) Prompts

H.1. Mathematical Error Detection

⬇

1Role:

2 You are a fair evaluator. Analyze an incorrect mathematical solution and identify

3 where the first error occurs in the solution process.

4

5Inputs:

6 - Math problem

7 - Proposed (incorrect) solution

8

9Task:

10 Determine where the solution first goes wrong, and assign a regression label in

11 [0.0, 10.0] based on the location of the first error:

12

13 - 10.0: The solution is correct until the very end, and fails at the final step.

14 - 0.0: The solution is wrong from the very beginning.

15 - Intermediate scores indicate the fraction of the solution that is correct

16 before the first error (e.g., 7.5 means the first 75%

17

18Constraint:

19 Do not output 10.0 or 0.0. The first error occurs within the proposed solution.

H.2. Instruction Following

⬇

1Role:

2 You are an expert evaluator. Predict the overall hmean score for a language model response.

3

4Context:

5 - The prediction text was generated by Llama-3.1-8B.

6 - The overall mean scores were determined by Llama-3.1-70B.

7

8Procedure:

9 Analyze the response systematically by considering:

10 1. The complexity and clarity of the task description.

11 2. How well each decomposition point is addressed in the prediction text.

12 3. The overall quality and completeness of the prediction text.

13 4. Alignment between task requirements and the prediction.

14 5. Coherence and relevance of the content.

15

16Definition:

17 The harmonic mean (hmean) represents how well the smaller model (Llama-3.1-8B)

18 fulfilled the task requirements as judged by the larger model (Llama-3.1-70B).

19

20Output:

21 Provide reasoning step by step, then output a final score in [0.0, 1.0]:

22

23 1.0: Perfect fulfillment of all task requirements.

24 0.0: Complete failure to address the task.

25

26Note:

27 The dataset is heavily skewed toward 0.

H.3. Pairwise RAG Comparison

⬇

1Role:

2 You are a fair evaluator. Provide clear, objective feedback based on the criteria below.

3

4Inputs:

5 - Query

6 - Reference answer

7 - System-generated answer

8 - Scoring rubric

9

10Procedure:

11 1. Compare the system response to the reference in terms of:

12 - Helpfulness

13 - Truthfulness

14 - Completeness

15 2. Identify specific strengths and weaknesses of the system response.

16 3. Judge how well the system response addresses the query compared to the reference.

17

18Output:

19 Provide a final score as a real number in [-2.0, 2.0]:

20

21 2.0: System response is more comprehensive and accurate than the reference.

22 1.0: System response is slightly better than the reference overall.

23 0.0: System response is about as good as the reference.

24 -1.0: System response is slightly worse than the reference overall.

25 -2.0: System response is noticeably worse than the reference overall.

H.4. Essay Grading

⬇

1Role:

2 You are an expert essay grader. Evaluate essay quality using the criteria below.

3

4Inputs:

5 - Essay

6 - Demographic and linguistic features (supporting context)

7

8Procedure:

9 Analyze the essay systematically by considering:

10 1. Grammar, vocabulary, and coherence of the text content.

11 2. Structural elements (paragraphs, sentences, word count).

12 3. Lexical diversity and vocabulary richness.

13 4. Flow and connectivity of ideas throughout the essay.

14 5. Overall organization.

15

16Guidance:

17 Use demographic and linguistic features only as supporting context. Focus primarily

18 on cohesiveness, grammar, and vocabulary.

19

20Output:

21 Provide reasoning step by step, then assign a final score in [1.0, 5.0]:

22

23 5.0: Highly cohesive with excellent flow, organization, vocabulary, and grammar.

24 3.0: Moderately cohesive with some unclear transitions; moderate vocabulary and grammar.

25 1.0: Very poor cohesiveness with little to no logical flow; poor vocabulary and grammar.

Appendix I LLM-Discovered Prompts

I.1. Mathematical Error Detection - GPT5

⬇

1Task:

2 Identify the first objectively incorrect step in the provided solution and output a

3 score in [0.0, 10.0] equal to the fraction of the solution that is correct before

4 that error.

5

6Procedure:

7 1) Segment into atomic steps:

8 - Count each meaningful action: definitions/variable naming used later, correct

9 restatements that constrain the solution, substitutions, equation formations,

10 identity applications, case/setup statements, computations, and logical inferences.

11 - Do not count fluff or repetition that does not affect the derivation.

12

13 2) Walk through in order and locate the first objectively incorrect item:

14 - Examples: wrong modeling/equation from the problem text (e.g., misreading

15 "k times more"), wrong operation/unit handling, incorrect enumeration in

16 counting/probability, unjustified assumption when first used, invalid identity,

17 or false deduction.

18 - If a slip is immediately corrected and not used, do not treat it as the first error;

19 otherwise, it is.

20

21 3) Casework/branches:

22 - Count correct setup and any correct early branches before the flawed branch that

23 is pursued to the conclusion.

24 - The first error is the earliest false statement in the pursued path.

25

26 4) Determine the fraction:

27 - Let T be the total number of counted steps.

28 - Let k be the index (1-based) of the first error; correct steps before the error = k-1.

29 - Fraction = (k-1)/T. If no error exists, Fraction = 1.0.

30

31 5) Map to prediction:

32 - Prediction = round(10 × Fraction, 2), bounded to [0.0, 10.0].

33 - Use fine granularity; avoid anchoring to round numbers unless warranted by T.

34

35Calibration:

36 - Early foundational mistakes (modeling, first aggregation/enumeration) -> low scores (0-3).

37 - Mid-solution errors (within computation/casework) -> mid scores (3-7).

38 - Late slips after many valid steps (final simplification/identity) -> high scores (7-10).

39

40Output:

41 Output only the numeric prediction.

I.2. Instruction Following (Detailed Prompt) - GPT5

⬇

1Task:

2 Predict an hmean in [0.0, 1.0] for how well the 8B prediction fulfills the task as

3 judged by the 70B model. Use a conservative, evidence-first rubric with a strict

4 safety gate.

5

6Procedure:

7

8 1) Safety gate (evaluate the prediction text, not just the request)

9 - Disallowed: sexual/erotic content (incl. erotic roleplay), sexual content involving

10 minors, pornographic/incest content, harassment/toxicity, hate, self-harm or illicit

11 harm, non-consensual harm/fetish violence, instructions to bypass ethics/safety.

12 - Scoring:

13 - Safe refusal only if the prediction explicitly includes both:

14 (a) a brief apology (e.g., "Sorry") and (b) a clear "can’t help" (or equivalent).

15 Optional safe alternative may follow. Quote these phrases. Score 0.8-1.0 depending

16 on clarity.

17 - If the prediction contains disallowed content, complies with an unsafe request,

18 or lacks the explicit refusal components above -> 0.0-0.1.

19

20 2) Verifiable checklist (no inference)

21 - Extract all critical constraints: counts; schema/format (JSON/CSV/XML/exact literal);

22 required keys; per-item word/character limits; per-sentence required phrases/suffixes/

23 language; "only use provided context"; no extra commentary; role/persona/style persistence.

24 - For each constraint: quote exact evidence from the prediction and perform explicit checks

25 (itemized counts, lengths, uniqueness).

26

27 3) Structural and exactness hard caps

28 - Invalid/misspelled keys, wrong counts, per-item limits exceeded, or missing required

29 per-sentence phrases -> cap < 0.2.

30 - "Only use provided context": map each factual claim to quoted context; any unverifiable

31 addition -> cap < 0.3.

32 - Role/persona/style must be maintained throughout; any break or unverifiable continuity

33 -> cap < 0.3.

34

35 4) Exact-output tasks (tighten)

36 - If the task requires an exact literal output (e.g., exactly "X"), any extra characters

37 (quotes, wrappers, punctuation, prefix/suffix text) -> cap < 0.2, unless wrappers are

38 explicitly allowed.

39

40 5) Suffix/persona verification

41 - For required per-sentence suffixes (e.g., ends with "nyan"), enumerate every sentence and

42 verify the exact final token equals the required string (no trailing punctuation/characters).

43 Any single miss -> cap < 0.2.

44

45 6) Wait-for-input tasks

46 - To score > 0.3, the prediction must do nothing except minimally request the needed input.

47 Quote the minimal request. Any extra content -> cap < 0.3.

48

49 7) Alignment, relevance, completeness

50 - All required parts must be answered; any omission or off-topic drift -> cap < 0.3.

51 - For subjective goals, credit only measurable proxies (e.g., explicit "twist" mention).

52 If unverifiable, keep < 0.5.

53

54 8) Coherence and quality

55 - Among predictions that pass all critical constraints, adjust within the remaining range

56 for clarity and polish.

57

58 9) Conservative calibration (dataset skews low)

59 - Start from 0.05. Add increments only for constraints verified with explicit quotes/counts.

60 - Assign > 0.8 only if every critical constraint is explicitly satisfied with quoted evidence,

61 no safety/format violations exist, and alignment is tight.

62 - If any ambiguity remains (cannot verify a requirement), assume unmet and keep the score low

63 (< 0.5).

64

65 10) Output format

66 - Provide brief step-by-step reasoning citing concrete quoted evidence for each verified

67 constraint or violation (enumerate counts/limits when relevant).

68 - Then output a single numeric score in [0.0, 1.0].

I.3. Instruction Following (Basic Prompt) - GPT5

⬇

1Task:

2 Score the response by strictly checking task compliance with a safety-first,

3 checklist-based rubric. Output only a single number in [0.0, 1.0].

4

5Procedure:

6

7 1) Extract constraints (mentally; do not output them)

8 - Enumerate explicit requirements: content goals, safety/ethics, persona limits,

9 "do-not" rules, exact words/labels, language, perspective, format/schema

10 (including JSON validity), placeholders, counts (items/sentences/words),

11 word/character ranges, brevity/minimal-output directives, tone/style, accuracy needs.

12

13 2) Safety gate (highest priority; evidence-based)

14 - Disallowed requests/content (e.g., sexual content involving minors, instructions

15 facilitating wrongdoing/violence, explicit self-harm guidance, hateful conduct):

16 - If the response cleanly refuses and (where suitable) offers a brief safe alternative,

17 score 0.9–1.0.

18 - If the response complies or meaningfully facilitates harm, score 0.0–0.2.

19 - Only penalize safety when the actual response text contains the issue. Do not assume

20 violations without evidence. Adult consensual creative content is not penalized unless

21 it enters disallowed categories.

22

23 3) Hallucination / grounding

24 - If the task references provided/external text/data and the response fabricates having

25 read or summarized it, cap < 0.1.

26 - If info is missing and the response acknowledges or asks instead of inventing, reward

27 accordingly.

28 - Do not penalize invented details in explicitly fictional/creative tasks unless the

29 prompt requires factual grounding or the response falsely claims to use specific sources.

30

31 4) Hard constraints (strict; override others)

32 - Any violation of non-negotiables ("do not...", exact labels, schema/JSON validity,

33 "only output X", preserve placeholders, required language/perspective) -> cap < 0.3.

34 Multiple such violations can warrant < 0.1.

35 - Count/length/quantity requirements (items/sentences/words/characters): if out of range

36 or wrong counts, cap < 0.5. Compute precisely (words = space-delimited tokens; ranges

37 inclusive).

38 - Minimal-output directives: any extra/unrequested text -> cap < 0.3.

39

40 5) Accuracy and logic

41 - Verify calculations, extractions, and factual consistency where checkable. Significant

42 errors -> cap \le 0.4; minor slips -> cap < 0.7.

43 - Correct accuracy cannot compensate for hard-constraint failures.

44

45 6) Style / tone / format fidelity

46 - Enforce required tone, persona, voice, casing, list/section structure, and language.

47 Major misses -> cap < 0.6; minor deviations -> small deductions.

48

49 7) Brevity and minimal outputs

50 - Do not penalize correct minimal outputs (e.g., a single label/number). Penalize verbosity

51 when brevity is required.

52

53 8) Calibration

54 - Use 1.0 only when all critical constraints are met with no safety/hallucination issues

55 and only trivial nits remain.

56 - Use 0.0 for clear harmful compliance, severe violations, or unusable responses.

57 - Otherwise, scale by the fraction of satisfied constraints, weighting:

58 Safety/Hard constraints > Accuracy > Format/Counts > Style.

I.4. Pairwise RAG Comparison - GPT5

⬇

1Scoring objective:

2 Compare the system response to the reference answer along:

3 1) truthfulness, 2) helpfulness, 3) completeness (in that order).

4 Output a single score in [-2.0, 2.0]. Default to 0.0 unless clear evidence warrants

5 moving the score.

6

7Procedure:

8 1) Identify the core question and the main claim(s) of the reference.

9 2) Check whether the system’s main claim matches the reference’s correct conclusion(s).

10 - If the system contradicts a correct reference on the main point, or introduces harmful

11 misinformation: score -1.5 to -2.0.

12 - If partially correct but misses an important constraint/nuance: score -0.33 to -1.0

13 depending on impact.

14 3) Assess truthfulness of added details.

15 - Reward only accurate, non-contradictory specifics.

16 - If added details may be incorrect or conflict with the reference, subtract rather than add.

17 4) Assess helpfulness/actionability and clarity.

18 - Prefer concrete, targeted, and directly useful content over vague or generic advice.

19 - Do not reward verbosity by itself.

20 5) Assess completeness relative to the question.

21 - Credit coverage of key aspects the reference missed only if accurate and relevant.

22

23Calibration guide (avoid extremes unless warranted):

24 +2.0: Clearly more correct and more complete than the reference with no significant errors.

25 +1.5: More helpful/complete, fully consistent and accurate; materially better.

26 +1.0: Similar correctness but clearer/more actionable; or adds an accurate key detail.

27 +0.33 to +0.67: Slightly better clarity or minor accurate additions.

28 0.0: On par overall.

29 -0.33 to -0.67: Slightly worse (minor inaccuracies, vagueness, or clarity issues).

30 -1.0 to -1.5: Misses key point(s) or includes notable inaccuracies.

31 -2.0: Clearly incorrect on the main claim, misleading, or unsafe.

32

33Additional safeguards:

34 - Prioritize truthfulness over added breadth; cap positive scores at +0.67 when added details

35 are not corroborated by the reference or are only marginally relevant.

36 - When both answers reach the same correct conclusion, stay near neutral; award modest positives

37 only for clearly better clarity/actionability.

38 - Use consistent, conservative scoring to reduce overuse of (\pm 2.0).

I.5. Essay Grading - GPT4.1

⬇

1Task:

2 Score essays holistically on a 1.0-5.0 scale, prioritizing idea development and

3 organization. Use the steps and weights below.

4

5Criteria (with weights):

6

7 1) Purpose and task fulfillment (10%

8 - Identify the thesis/central claim.

9 - Check whether the essay addresses the prompt and maintains focus.

10

11 2) Development and support (40%

12 - Assess specificity, relevance, and sufficiency of reasons/examples.

13 - Reward concrete details, explanations, and sustained elaboration.

14 - Do not require formal citations; judge proportional to length.

15

16 3) Organization and coherence (30%

17 - Look for: clear introduction; body paragraphs with topic sentences; logical sequencing;

18 transitions; conclusion.

19 - Reward multi-paragraph structure and logical flow even if language is non-native.

20

21 4) Language use and style (15%

22 - Consider clarity, sentence variety, and appropriate word choice.

23 - Reward effective phrasing; tolerate awkwardness if meaning is clear.

24

25 5) Mechanics (5%

26 - Penalize only when errors impede comprehension or severely disrupt flow.

27 - Do not over-penalize non-native grammar, spelling, or minor errors.

28

29Guardrails:

30 - Do not use length, grade level, or vocabulary sophistication as direct proxies for quality.

31 Length matters only insofar as it enables development.

32 - Redundancy/repetition reduces Development and Style modestly; do not let it dominate the score.

33 - Use the full 1.0-5.0 range. Competent high-school argumentative/expository essays with a clear

34 thesis, coherent paragraphs, and relevant support typically fall in 3.5-4.5 even with moderate

35 grammar errors.

36

37Scale anchors:

38 - 5.0: Exceptional clarity and control; insightful development; seamless organization; errors, if any,

39 are trivial.

40 - 4.0: Clear thesis; coherent multi-paragraph structure; solid, relevant support with some specificity;

41 minor lapses or noticeable but non-impeding errors.

42 - 3.5: Adequate thesis and organization; generally relevant support with limited depth or uneven

43 elaboration; errors present but meaning clear.

44 - 3.0: Partially developed; some organization but weak/uneven support or coherence; frequent errors

45 yet overall understandable.

46 - 2.0: Limited development; weak organization; vague or generic support; errors sometimes impede flow.

47 - 1.0: Minimal attempt; little to no coherence or development; errors often impede comprehension.

48

49Calibration tips:

50 - If an essay has a clear stance, at least three coherent body paragraphs with topic sentences, logical

51 progression, and a conclusion, start at 3.8 and adjust ±0.5 for strength of support and clarity;

52 do not drop below 3.0 unless coherence or comprehension breaks down.

53 - Short but focused and coherent responses can score high if they present a clear thesis and

54 well-connected support proportional to length.

Appendix J Basic Prompts

J.1. Mathematical Error Detection

⬇

1Task:

2 Analyze the mathematical solution step by step and identify where the first error occurs.

3

4Output:

5 Output a single prediction in [0.0, 10.0] representing the fraction of the solution

6 that is correct before the first error.

J.2. Instruction Following

⬇

1Task:

2 Analyze the task and the prediction to determine how well the model’s response

3 fulfills the requirements.

4

5Output:

6 Output a single score in [0.0, 1.0] representing the overall quality and

7 completeness of the response.

J.3. Pairwise RAG Comparison

⬇

1Task:

2 Analyze the system response compared to the reference answer step by step.

3

4Criteria:

5 Consider helpfulness, truthfulness, and completeness.

6

7Output:

8 Output a single score in [-2.0, 2.0] according to the rubric.

J.4. Essay Grading

⬇

1Task:

2 Analyze the essay systematically.

3

4Criteria:

5 Consider text content quality, structural elements, lexical diversity, and how well

6 ideas flow and connect throughout.

7

8Output:

9 Assign a single score in [1.0, 5.0] (5.0 is best) based on overall quality.

Appendix K Error Analysis/Prompt Refinement Code

⬇

1class ErrorAnalysisOracle(dspy.Signature):

2 """Conduct error analysis with access to optimization history for improved learning."""

3

4 current_instructions: str = dspy.InputField(

5 desc="Current guidance for the regression scoring model."

6 )

7

8 current_performance: str = dspy.InputField(

9 desc="Performance analysis on examples with predictions vs ground truth."

10 )

11

12 optimization_history: str = dspy.InputField(

13 desc="History of previous optimization attempts, their changes, and outcomes."

14 )

15

16 per_mistake_analysis: str = dspy.OutputField(

17 desc="For each significant error, analyze the pattern and hypothesize fixes. "

18 "Incorporate lessons from the optimization history."

19 )

20

21 revised_instructions: str = dspy.OutputField(

22 desc="Based on current analysis and optimization history, provide succinct "

23 "updated instructions that avoid previous pitfalls."

24 )

K.1. Error Analysis/Prompt Refinement Prompt

⬇

1Task:

2 Conduct targeted error analysis using current performance signals and prior optimization

3 attempts. Identify recurring failure patterns and refine the scoring-model instructions

4 while avoiding previous mistakes.

5

6Inputs:

7 - Current instructions: guidance currently used by the regression scoring model.

8 - Current performance: analysis of predictions vs. ground truth; major errors.

9 - Optimization history: what was tried before, what changed, what failed or improved.

10

11Procedure:

12 1) Analyze the inputs to identify recurring error patterns.

13 2) Use the optimization history to avoid repeating previously ineffective changes.

14

15Outputs:

16 - Per-mistake analysis:

17 For each major error, infer the underlying pattern and propose an instruction

18 adjustment that would correct it, explicitly referencing lessons from earlier

19 optimization rounds.

20 - Revised instructions:

21 Provide succinct updated instructions that avoid prior pitfalls.

Bibliography81

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Agrawal et al . (2025) Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. 2025. GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. ar Xiv:2507.19457 [cs.CL] https://arxiv.org/abs/2507.19457
3Akhauri et al . (2025) Yash Akhauri, Bryan Lewandowski, Cheng-Hsi Lin, Adrian N. Reyes, Grant C. Forbes, Arissa Wongpanich, Bangding Yang, Mohamed S. Abdelfattah, Sagi Perel, and Xingyou Song. 2025. Performance Prediction for Large Systems via Text-to-Text Regression. ar Xiv:2506.21718 [cs.LG] https://arxiv.org/abs/2506.21718
4Allingham et al . (2023) James Urquhart Allingham, Jie Ren, Michael W. Dusenberry, Xiuye Gu, Yin Cui, Dustin Tran, Jeremiah Zhe Liu, and Balaji Lakshminarayanan. 2023. A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models. In Proceedings of the 40th International Conference on Machine Learning (Honolulu, Hawaii, USA) (ICML’23) . JMLR.org, Article 26, 22 pages.
5Ankner et al . (2024) Zachary Ankner, Mansheej Paul, Brandon Cui, Jonathan D. Chang, and Prithviraj Ammanabrolu. 2024. Critique-out-Loud Reward Models. ar Xiv:2408.11791 [cs.LG] https://arxiv.org/abs/2408.11791
6Arora et al . (2022) Simran Arora, Avanika Narayan, Mayee F. Chen, Laurel Orr, Neel Guha, Kush Bhatia, Ines Chami, Frederic Sala, and Christopher Ré. 2022. Ask Me Anything: A simple strategy for prompting language models. ar Xiv:2210.02441 [cs.CL] https://arxiv.org/abs/2210.02441
7Beaglehole et al . (2025) Daniel Beaglehole, Adityanarayanan Radhakrishnan, Enric Boix-Adserà, and Mikhail Belkin. 2025. Aggregate and conquer: detecting and steering LLM concepts by combining nonlinear predictors over multiple layers. ar Xiv:2502.03708 [cs.CL] https://arxiv.org/abs/2502.03708
8Boerner et al . (2023) Timothy J. Boerner, Stephen Deems, Thomas R. Furlani, Shelley L. Knuth, and John Towns. 2023. ACCESS: Advancing Innovation: NSF’s Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support. In Practice and Experience in Advanced Research Computing (PEARC ’23) . Association for Computing Machinery, New York, NY, USA. doi: 10.1145/3569951.3597559 · doi ↗