Code-Guided Reasoning for Small Language Models: Evaluating Executable MCQA Scaffolds
Prateek Biswas, Dhaval Patel, Vedant Khandelwal, Shuxin Lin, Amit Sheth

TL;DR
This paper introduces Code-Guided Reasoning (CGR), a standardized evaluation protocol and resource to measure how executable reasoning scaffolds improve small language models' performance on multiple-choice questions.
Contribution
CGR provides a comprehensive framework and dataset for assessing the impact of executable reasoning scaffolds on small language model accuracy in MCQA tasks.
Findings
Assisted inference significantly improves accuracy by 28.10 percentage points.
Time-MQA shows observed regressions in model performance.
Generated programs sometimes violate no-hard-coding instructions.
Abstract
Multiple-choice QA benchmarks usually evaluate small language models (SLMs) as direct answerers, but deployed language-model systems increasingly rely on external scaffolds such as tools, code, and repeated model calls. We introduce Code-Guided Reasoning (CGR), an evaluation protocol and generated-program resource for measuring when executable reasoning scaffolds improve SLM performance on MCQA tasks. CGR standardizes six components: a normalized item interface, a direct solver prompt, a generator prompt, a Python scaffold, solver-call and extraction helpers, and a three-channel result record. On 20,498 retained result rows from a locally prepared MCQA bundle and six metadata-registered solver models, the observed non-zero-baseline partition shows 66.21% macro assisted accuracy versus 38.11% direct accuracy, a +28.10 percentage-point difference with a pair-bootstrap interval of [20.32,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
