Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models
Kihyuk Lee

TL;DR
This study evaluates the consistency of exercise prescriptions generated repeatedly by three large language models, revealing significant differences in output behavior and emphasizing the importance of model choice for clinical reliability.
Contribution
It provides a comparative analysis of repeated generation consistency across three LLMs, highlighting fundamental behavioral differences and implications for clinical deployment.
Findings
GPT-4.1 produced entirely unique outputs with stable semantic content.
Gemini 2.5 Flash showed high output repetition due to text duplication.
Safety expressions were uniformly high across models, limiting their usefulness as a differentiation metric.
Abstract
This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
