Consistency of AI-Generated Exercise Prescriptions: A Repeated Generation Study Using a Large Language Model
Kihyuk Lee

TL;DR
This study assesses the consistency of AI-generated exercise prescriptions from a large language model across multiple dimensions, revealing high semantic similarity but variability in quantitative details, emphasizing the need for further validation.
Contribution
It provides a comprehensive analysis of the intra-model consistency of LLM-generated exercise prescriptions across semantic, structural, and safety aspects.
Findings
Semantic similarity was high across scenarios (mean cosine similarity: 0.879-0.939).
Variability was observed in quantitative components like exercise intensity.
Safety expressions were consistently included, but their counts varied significantly.
Abstract
Background: Large language models (LLMs) have been explored as tools for generating personalized exercise prescriptions, yet the consistency of outputs under identical conditions remains insufficiently examined. Objective: This study evaluated the intra-model consistency of LLM-generated exercise prescriptions using a repeated generation design. Methods: Six clinical scenarios were used to generate exercise prescriptions using Gemini 2.5 Flash (20 outputs per scenario; total n = 120). Consistency was assessed across three dimensions: (1) semantic consistency using SBERT-based cosine similarity, (2) structural consistency based on the FITT principle using an AI-as-a-judge approach, and (3) safety expression consistency, including inclusion rates and sentence-level quantification. Results: Semantic similarity was high across scenarios (mean cosine similarity: 0.879-0.939), with greater…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
