Code-enabled language models can outperform reasoning models on diverse tasks

Cedegao E. Zhang; C\'edric Colas; Gabriel Poesia; Joshua B. Tenenbaum; Jacob Andreas

arXiv:2510.20909·cs.CL·October 27, 2025

Code-enabled language models can outperform reasoning models on diverse tasks

Cedegao E. Zhang, C\'edric Colas, Gabriel Poesia, Joshua B. Tenenbaum, Jacob Andreas

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that standard instruction-tuned language models, when combined with code execution and few-shot learning, can outperform specialized reasoning models across various tasks without additional fine-tuning.

Contribution

The authors introduce CodeAdapt, a simple method combining code interleaving with few-shot learning, enabling LMs to excel at reasoning tasks without finetuning.

Findings

01

CodeAdapt enables LMs to outperform reasoning models on average across multiple tasks.

02

It improves token efficiency by up to 81%.

03

Code reasoning traces show diverse and rich problem-solving strategies.

Abstract

Reasoning models (RMs), language models (LMs) trained with reinforcement learning to produce long-form natural language reasoning, have been remarkably successful, but they still require large amounts of computation and data to train, and can be slow and expensive to run. In this paper, we show that standard instruct LMs can already be elicited to be strong reasoners at a level comparable to or even surpassing their corresponding RMs (e.g., DeepSeek V3 vs R1) without finetuning, across diverse domains from instruction following and creative generation to mathematical reasoning. This is achieved by CodeAdapt, our simple recipe that combines the CodeAct framework, where LMs interleave natural language reasoning with code execution in a multi-step fashion, with few-shot bootstrap in-context learning from as few as five training problems. Analyzing four matched pairs of LMs and RMs, we find…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. Enhancing the reasoning capabilities of models under low-resource conditions is a topic worthy of research. 2. The paper is easy to read.

Weaknesses

1. The contribution is limited, which is a small incremental work upon CodeAct. 2. The work does not compare with other training-free and training-based methods for enhancing LLM reasoning. 3. The experimental setup is limited to 30/32B and API-based LLMs. The lack of experiments with 7B models raises questions about whether the method's effectiveness is overly dependent on the inherent reasoning capabilities of the LLMs. 4. The dataset used in the paper is relatively small, which is insufficien

Reviewer 02Rating 6Confidence 5

Strengths

- The research addresses a timely and novel question—whether code-augmented LMs can compete with expensively trained RMs—offering a fresh perspective on resource-efficient AI reasoning. - The proposed CodeAdapt framework is both effective and practical, achieving superior performance across multiple tasks with minimal data and computational overhead, as validated by extensive experiments. - The paper is well-structured and clearly written, with comprehensive evaluations, ablation studies, and in

Weaknesses

While the study is thorough, future work could explore the scalability of CodeAdapt to a broader range of models and real-world applications to further strengthen its generalizability.

Reviewer 03Rating 2Confidence 4

Strengths

1. By adding few-shot bootstrap in-context learning on top of CodeAct, it enables self-exploration of reasoning trajectories, eliminating the need for expert demonstrations. 2. It achieves performance improvements across multiple domains.

Weaknesses

1. The main issue with this paper is the lack of originality. Using code as a form of reasoning has already been widely studied in previous work. This paper mainly adds few-shot in-context learning on top of CodeAct, which does not provide substantial new insights. Domain adaptation through in-context examples has also been extensively explored in prior research. 2. The paper lacks comparisons with other few-shot in-context learning or few-shot domain adaptation methods.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Reinforcement Learning in Robotics