Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Kihyuk Lee

arXiv:2604.19598·cs.CL·April 24, 2026

Cross-Model Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Across Three Large Language Models

Kihyuk Lee

PDF

TL;DR

This study evaluates the consistency of exercise prescriptions generated repeatedly by three large language models, revealing significant differences in output behavior and emphasizing the importance of model choice for clinical reliability.

Contribution

It provides a comparative analysis of repeated generation consistency across three LLMs, highlighting fundamental behavioral differences and implications for clinical deployment.

Findings

01

GPT-4.1 produced entirely unique outputs with stable semantic content.

02

Gemini 2.5 Flash showed high output repetition due to text duplication.

03

Safety expressions were uniformly high across models, limiting their usefulness as a differentiation metric.

Abstract

This study compared repeated generation consistency of exercise prescription outputs across three large language models (LLMs), specifically GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash, under temperature=0 conditions. Each model generated prescriptions for six clinical scenarios 20 times, yielding 360 total outputs analyzed across four dimensions: semantic similarity, output reproducibility, FITT classification, and safety expression. Mean semantic similarity was highest for GPT-4.1 (0.955), followed by Gemini 2.5 Flash (0.950) and Claude Sonnet 4.6 (0.903), with significant inter-model differences confirmed (H = 458.41, p < .001). Critically, these scores reflected fundamentally different generative behaviors: GPT-4.1 produced entirely unique outputs (100%) with stable semantic content, while Gemini 2.5 Flash showed pronounced output repetition (27.5% unique outputs), indicating…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.