Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

Prateek Biswas; Dhaval Patel; Vedant Khandelwal; Shuxin Lin; Amit Sheth

arXiv:2605.18827·cs.IR·May 20, 2026

Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds

Prateek Biswas, Dhaval Patel, Vedant Khandelwal, Shuxin Lin, Amit Sheth

PDF

TL;DR

This paper introduces Code-Guided Reasoning (CGR), a standardized evaluation protocol and resource to measure how executable reasoning scaffolds improve small language models' performance on multiple-choice questions.

Contribution

CGR provides a comprehensive framework and dataset for assessing the impact of executable reasoning scaffolds on small language model accuracy in MCQA tasks.

Findings

01

Assisted inference significantly improves accuracy by 28.10 percentage points.

02

Time-MQA shows observed regressions in model performance.

03

Generated programs sometimes violate no-hard-coding instructions.

Abstract

Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.