O1 Replication Journey -- Part 3: Inference-time Scaling for Medical   Reasoning

Zhongzhen Huang; Gui Geng; Shengyi Hua; Zhen Huang; Haoyang Zou,; Shaoting Zhang; Pengfei Liu; Xiaofan Zhang

arXiv:2501.06458·cs.CL·January 14, 2025

O1 Replication Journey -- Part 3: Inference-time Scaling for Medical Reasoning

Zhongzhen Huang, Gui Geng, Shengyi Hua, Zhen Huang, Haoyang Zou,, Shaoting Zhang, Pengfei Liu, Xiaofan Zhang

PDF

1 Repo 1 Datasets

TL;DR

This paper investigates how increasing inference time in large language models enhances medical reasoning performance, demonstrating improved diagnostic accuracy and reasoning complexity through extensive experiments on medical benchmarks.

Contribution

It introduces the concept of inference-time scaling for medical reasoning in LLMs and shows its effectiveness in improving performance and reasoning quality.

Findings

01

Inference-time scaling improves performance by 6%-11%.

02

Longer reasoning chains are necessary for complex tasks.

03

Generated diagnoses follow the hypothetico-deductive method.

Abstract

Building upon our previous investigations of O1 replication (Part 1: Journey Learning [Qin et al., 2024] and Part 2: Distillation [Huang et al., 2024]), this work explores the potential of inference-time scaling in large language models (LLMs) for medical reasoning tasks, ranging from diagnostic decision-making to treatment planning. Through extensive experiments on medical benchmarks of varying complexity (MedQA, Medbullets, and JAMA Clinical Challenges), our investigation reveals several key insights: (1) Increasing inference time does lead to improved performance. With a modest training set of 500 samples, our model yields substantial performance improvements of 6%-11%. (2) Task complexity directly correlates with the required length of reasoning chains, confirming the necessity of extended thought processes for challenging problems. (3) The differential diagnoses generated by our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

spiral-med/ophiuchus
noneOfficial

Datasets

SII-SPIRAL-MED/o1-journey-Ophiuchus
dataset· 3 dl
3 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSparse Evolutionary Training