A Compute-Matched Re-Evaluation of TroVE on MATH

Tobias Sesterhenn; Ian Berlot-Attwell; Janis Zenkner; Christian Bartelt

arXiv:2507.22069·cs.PL·August 1, 2025

A Compute-Matched Re-Evaluation of TroVE on MATH

Tobias Sesterhenn, Ian Berlot-Attwell, Janis Zenkner, Christian Bartelt

PDF

Open Access

TL;DR

This paper re-evaluates TroVE's effectiveness on the MATH benchmark, finding that its apparent benefits are mainly due to increased computational resources rather than its toolbox mechanism.

Contribution

It demonstrates that TroVE's advantage over baseline methods is primarily from higher compute, and not from its toolbox or reuse strategies, after correcting and controlling for compute.

Findings

01

TroVE's performance gain is largely due to more compute used.

02

Correcting TroVE's selection mechanism improves accuracy by 3%.

03

After compute matching, TroVE's advantage drops to 1%.

Abstract

Reusing established theorems and formulas is central to mathematical problem solving, serving as essential building blocks for tackling increasingly complex challenges. Recent work, TroVE, argues that code-generating Large Language Models (LLMs) can benefit similarly on the MATH benchmark by inducing and reusing higher-level toolboxes. By allocating computational budget across an ensemble of three modes -- directly generating code, creating tools, and reusing tools -- TroVE claims to outperform a PRIMITIVE baseline that only performs direct generation. However, recent analysis (Berlot-Attwell et al., 2024) casts doubt on these gains, noting that the tools created are often trivial or rarely reused, suggesting that improvements may stem from self-consistency or self-correction. In this work, we re-evaluate TroVE on MATH, analyze the impact of each of its modes, and show that its benefit…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAnomaly Detection Techniques and Applications