Revisiting the evidence on caffeine mouth rinse: effects on exercise and cognitive performance: a meta-analytic review
Hengzhi Deng, Xiaohan Fan, Tianyu Song, Nasnoor Juzaily bin Mohd Nasiruddin, Abdullah Al-Hadi Ahmad Fuaad, Mohamed Nashrudin bin Naharudin

TL;DR
Caffeine mouth rinse may offer small exercise benefits, especially for aerobic endurance, with optimal results from short rinses and moderate caffeine exposure.
Contribution
This study provides an updated meta-analysis on caffeine mouth rinse effects, identifying context-specific benefits and optimal application parameters.
Findings
Caffeine mouth rinse shows trivial-to-small improvements in aerobic endurance performance.
Short (~5 s) rinses outperform longer durations, with higher exposure potentially reducing effectiveness.
Cognitive effects are inconsistent, but processing speed shows more sensitivity than accuracy.
Abstract
Caffeine mouth rinsing (Caff-MR) may activate oropharyngeal receptors and rapidly engage central networks for motivation, attention, and pacing without systemic absorption. The only prior meta-analysis found no stable ergogenic effect, yet the evidence base has continued to expand and remains heterogeneous. Six electronic databases were searched up to 2 October 2025 for Caff-MR studies on exercise and cognitive outcomes. Study quality was assessed using modified PEDro and RoB-2. Three-level meta-analyses synthesized both outcomes. Prespecified moderators were sex, training status, habitual caffeine use, feeding state, exercise or cognitive type, rinse duration, and total oral exposure. Sensitivity analyses addressed assumed within-subject correlations, outliers, and influential cases. Thirty-one studies (k = 167 effects) met inclusion. Caff-MR was associated with trivial-to-small…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3| Study; | Exercise protocol | Sample;Training status | Mean age (y); | Dosage of CAF and PLA; | Washout period | Performance outcomes | Statistical significance |
|---|---|---|---|---|---|---|---|
| Barbosa et al. [ | 800 m running | 7 males; | 24.6 ± 11.5; | CAF: 300 mg | ≥1 week | Time (s): CAF: 189.6 ± 30.4 vs PLA: 185.6 ± 30.3 | Performance: |
| Beaven et al. [ | 5 × 6 s cycling sprints with 24 s rest between each | 12 males; | 32 ± 7.5; | CAF: 300 × 5 mg + saccharin | ≥48 h | Peak power compared to PLA (W): Sprint 1-5: 21.43 ± 34.28, 13.13 ± 28.12, −2.42 ± 25.95, 8.30 ± 33.75, −12.86 ± 32.14 | Performance: |
| Boat et al. [ | Self-control exertion (S10) or non-self-control exertion (N10) 10 km cycling time trial test | 15 males, | 22.4 ± 2.56; | CAF: 200 mg | ≥48 h | Overall time of S10 and N10 (s): CAF: 990 ± 89.08 and 986 ± 89.08 vs PLA: 996 ± 89.08 and 989 ± 92.95 | Performance: |
| Bottoms et al. [ | 30 minutes self-pacing cycling | 12 males; | 20.5 ± 0.7; | CAF: 40 mg | 1 week | Distance (km): CAF: 16.2 ± 2.8 vs PLA: 14.9 ± 2.6 | Performance: |
| Clarke et al. [ | 1 RM bench press; | 15 males; | 21 ± 2; | CAF: 300 mg + 200 mg sucralose | ≥48 h | 1 RM (kg): CAF: 87.26 ± 17.74 vs PLA: 86.16 ± 16.83 | Performance: |
| Doering et al. [ | Task equal to 60 minutes cycling at 75% peak power | 10 males; | 32.9 ± 7.5; | CAF: 280 mg + de-carbonated, non-caffeinated diet cola | 1 week | Time (s): CAF: 3918 ± 243 vs PLA: 3940 ± 227 | Performance: |
| Dolan et al. [ | Yo-Yo intermittent recovery test-Level 1 | 10 males; | 19.9 ± 1.3; | CAF: 300 mg + sucralose and nonsugar, cherry flavoured sweetener | 1 week | Distance (m): CAF: 1342 ± 320 vs PLA: 1397 ± 360 | Performance: |
| Farmani et al. [ | Throwing medicine ball; | 18 males; | 21.86 ± 2.40; | CAF: Coffee, but total exposure is about (240 + 60) mg | 1 week | Throwing distance difference between CAF and PLA (m): 0.24 ± 0.83 | Performance: |
| Figueiredo et al. [ | 10 km running; | 10 (8 males and 2 females); | 30.1 ± 6.4; | CAF: 300 mg | 1 week | Times (s): CAF: 47.45 ± 6.34 vs PLA: 47.07 ± 5.18 | Performance: |
| Gough et al. [ | Repeated sprint ability test | 9 males; | 21 ± 3; | CAF: 400 mg + sucralose | ≥96 h | Pre- and post-mean power (W): CAF: 217.37 ± 18.16 and 218.16 ± 25.79 vs PLA: 212.89 ± 27.37 and 222.37 ± 25.79 | Performance: |
| Karayiğit et al. [ | 30 s Wingate test | 10 males; | 20.50 ± 1.58; | CAF: 3000 mg + sodium saccharin | 3–5 days | Mean power (W): CAF: 641.57 ± 77.53 vs PLA: 648.31 ± 74.16 | Performance: |
| Karayigit, Ali et al. [ | 1 RM squat and bench press; | 27 (13 males, 14 females); | Males: 24 ± 3; | CAF: 11000 mg + sucralose | 48–96 h | Male and female 1 RM bench press (kg): CAF: 102.35 ± 11.85 and 66.30 ± 5.3 vs PLA: 104.37 ± 11.85 and 65.80 ± 4.28 | Performance: |
| Karayigit, Koz et al. [ | 1 RM bench press; | 14 males; | 23 ± 2; | CAF: (250 + 2 × 3 × 250/(500 + 2 × 3 × 500)/(750 + 2 × 3 × 750) mg + sucralose | 48 h | 1 RM bench press (kg): 1%, 2%, 3% CAF: 96.31 ± 5.28, 97.39 ± 5.60, 97.49 ± 6.58 vs PLA: 97.92 ± 5.61 | Performance: |
| Karuk et al. [ | 2 × 30 s vertical jump, with 5 minutes rest between each | 8 males; | 22.3 ± 4.2; | CAF: 300 mg | ≥48 h | Maximal, mean, minimal jump height change from the first to the second test under CAF relative to PLA: −0.80 ± 4.73, −4.6 ± 8.0, −7.0 ± 12.9 | Performance: |
| Kizzi et al. [ | 5 × 6 s cycling sprints, with 24 s rest between each | 8 males; | 23 ± 2; | CAF: 3000 mg | 1 week | Peak power (W): CAF: 643 ± 79 vs PLA: 573 ± 79 | Performance: |
| Marinho et al. [ | 30 s Wingate test | 10 males; | 24.8 ± 3.7; | CAF: 300 mg + calorie-free mint flavour | ≥48 h | Peak power (W/kg): CAF: 15.05 ± 0.68 vs PLA: 14.99 ± 0.74 | Performance: |
| Marinho et al. [ | 30 minutes constant load cycling to fatigue and then 10 km cycling time trial test | 10 males; | 24.7 ± 3.6; | CAF: 1200 mg + cellulose capsules | 3–7 days | Time (s): CAF: 1363 ± 345 vs PLA: 1321 ± 320 | Performance: |
| Melo et al. [ | 80% respiratory compensation point cycling until task failure | 12 males; | 22.0 ± 2.8; | CAF: 300 × (?) mg + non-caloric mint essence | 72–96 h | Exhaustion time (minutes): CAF: 91 ± 22 vs PLA: 76 ± 19 | Performance: |
| Miraftabi et al. [ | 20 m sprit; | 13 males; | 18.1 ± 0.9; | CAF: 6000 mg + non-caloric mint essence | 1 week | Fast and fed sprint time (s): CAF: 3.3 ± 0.5 and 3.45 ± 0.2 vs PLA: 3.4 ± 0.1 and 3.5 ± 0.1 | Performance: |
| Nabuco et al. (2021) | 75% peak power cycling to failure | 10 males; | 32 ± 3; | CAF: (?) × 85 mg | 1 week | Time (s): CAF: 2004 ± 767 vs PLA: 1688 ± 618 | Performance: |
| Pak et al. (2020) | 6 × Taekwondo anaerobic intermittent kick tests | 27 (18 males, 9 females); | Male: 18 ± 4; | CAF: 6 mg/kg + artificial saccharine | N/A | Pre-, during (4 weeks) and post- Ramadan successful kicks (times): CAF: 40.4 ± 6.9, 36.6 ± 6.05, 39.0 ± 7.5, 39.3 ± 6.8, 40.0 ± 6.6 and 40.4 ± 6.1 vs PLA: 40.1 ± 6.9, 33.9 ± 5.8, 36.3 ± 7.0, 37.8 ± 6.7, 38.9 ± 6.6 and 40.1 ± 6.4 | Yes |
| Pataky et al. (2016) | 3 km cycling time trial | 38 (25 males, 13 females); | 21 ± 1; | CAF: 600 mg + saccharine | 3–7 days | Mean power output percent difference between CAF and PLA (W): −0.17 ± 8.32 | Performance: |
| Şahin et al. (2024) | 10 × 6 s repeated cycling sprints | 16 males; | 21.6 ± 3.39; | CAF: 300 mg + non-caloric sweetener | ≥48 h | Relative peak power and mean power (W/kg): CAF: 7.53 ± 0.88 and 5.90 ± 0.72 vs PLA: 7.45 ± 0.84 and 5.81 ± 0.70 | Performance: |
| Sinclair & Bottoms (2015) | 30 minutes arm crank time trial | 12 males; | 21.54 ± 1.28; | CAF: 32 mg | N/A | Distance (km): CAF: 15.43 ± 3.27 vs PLA: 13.15 ± 3.36 | Performance: |
| Taheri Karami et al. (2023) | Futsal intermittent endurance test; | 24 males; | 19.09 ± 1.57; | CAF: Coffee, but total exposure is about (300 + 60)/(625 + 125) mg | 1 week | Distance of endurance test (m): 0.24% and 0.5% CAF: 1494.4 ± 220 and 1677.7 ± 206.1 vs PLA: 1439.8 ± 236.3 | Performance: |
| Tallis et al. (2024) | Countermovement jump; | 27 males; | 20 ± 2; | CAF: 3 mg/kg + 20 mL water + 30 mL sugar-free orange drink | 3–5 days | Countermovement jump height (cm): CAF: 32.3 ± 6.9 vs PLA: 31.3 ± 5.6 | Performance: |
| Study; | Exercise protocol | Sample;Training status | Mean age (y); | Dosage of CAF and PLA; | Washout period | Performance outcomes | Statistical significance |
|---|---|---|---|---|---|---|---|
| Balcı et al. [ | Victorias Stroop test (Part D and W for reaction time, Part C for response inhibition) | 30 males; | 22.7 ± 3.3; | CAF: 60/150/300 mg + sugar-free orange flavour | 1 week | Pre- and post- Part D reaction time (s): Male: 0.24% CAF: 58.95 ± 19.36 and 58.23 ± 15.91 vs 0.6% CAF: 63.53 ± 21.22 and 59.15 ± 19.00 vs 1.2% CAF: 62.93 ± 19.07 and 57.01 ± 16.74 vs PLA: 61.95 ± 19.57 and 61.58 ± 15.93 | Performance: |
| De Pauw et al. [ | Stroop task | 10 males; | 27 ± 3; | CAF: 300 mg + sodium salt of saccharin | N/A | Pre- and post- of congruent reaction time (ms): CAF: 630.3 ± 119.9 and 605.1 ± 85.2 vs PLA: 599.8 ± 56.5 and 597.8 ± 66.7 | Performance: |
| Karayigit, Ali, et al. [ | Modified arrow flanker task | 27 (13 males, 14 females); | Males: 24 ± 3; | CAF: 4000 mg + sucralose | 48-96 h | Pre- and post- congruent accuracy (%): Male: CAF: 95.60 ± 1.8 and 96.78 ± 2.3 vs PLA: 96.86 ± 1.9 and 96.72 ± 2.4; Female: CAF: 96.37 ± 1.2 and 95.49 ± 1.9 vs PLA: 95.90 ± 1.5 and 96.13 ± 2.2 | Performance: |
| Pomportes et al. [ | Duration-production task and Simon task during 40 minutes 60% peak power cycling | 24 (16 males, 8 females); | Males: 24 ± 6; | CAF: 201 mg + orange sugar-free syrup | ≥72 h | Overall variance of duration production task (ms): CAF: 180.7 ± 39.98 vs PLA: 195.0 ± 60.80 | Performance: |
| Şahin et al. [ | Kick reaction time; | 16 males; | 21.6 ± 3.39; | CAF: 300 mg + non-caloric sweetener | ≥48 h | Kicking reaction time of set 1 and 2 (ms): CAF: 475.1 ± 76.5 and 422.0 ± 67.4 vs PLA: 435.4 ± 63.4 and 429.6 ± 55.5 | Performance: |
| Toktaş et al. [ | Stroop colour-word test; | 65 (24 males and 41 females); | Male: 29.91 ± 12.06; | CAF: Coffee, but total exposure is about 32.5 mg | ≥1 week | Pre- and post- Stroop test reaction time of male (s): CAF: 24.92 ± 14.20 and 21.00 ± 6.02 vs PLA: 23.00 ± 7.83 and 21.92 ± 5.39 | Performance: |
| Virdinli et al. [ | Hand reaction time test; | 45 males; | 18 ± 3; | CAF: 300/450/600 mg | ≥3 days | Hand reaction time (ms): 1.2% CAF: 393.57 ± 52.14 vs 1.8% CAF: 411.43 ± 54.28 vs 2.4% CAF: 362.86 ± 35.71 vs PLA: 462.86 ± 82.85 | Performance: |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCoffee research and impacts · Muscle metabolism and nutrition · Pharmacology and Obesity Treatment
Introduction
Caffeine is among the most widely used psychoactive substances worldwide and has been extensively studied as an ergogenic aid [1,2]. Traditional oral ingestion of caffeine has been shown to improve endurance, strength, and power performance, largely through its central nervous system effects mediated by adenosine receptor antagonism [3]. However, ingestion can be accompanied by gastrointestinal discomfort, delayed absorption, and variable individual responses due to genetic and metabolic differences, such as CYP1A2 polymorphisms [4]. These limitations have prompted interest in caffeine mouth rinsing (Caff-MR), in which a solution is swilled briefly and expectorated to elicit rapid central effects without systemic uptake [5].
Existing neuroimaging studies have found that simply rinsing with caffeine without ingesting it may activate the insula, orbitofrontal cortex, and striatum, thereby enhancing central drive and cognitive control [6]. Consistent with these findings, several trials have reported improvements in repeated-sprint performance, self-paced cycling, or muscular endurance following Caff-MR [7–9]. However, other studies have failed to demonstrate benefits for time-trials, maximal strength, or running performance [5,10,11], leaving the contexts in which this strategy is effective uncertain.
Previous systematic reviews and meta-analyses have attempted to synthesise these findings, yet their conclusions remain unclear. A recent meta-analysis that included 16 studies reported a very small and nonsignificant effect of Caff-MR on exercise outcomes, emphasising the large variability in supplementation protocols and participant characteristics [12]. Another review that examined 18 studies (15 physical and 3 cognitive) observed consistent improvements in cognitive performance but found the evidence for physical outcomes to be inconsistent, with benefits more likely when rinsing was repeated during exercise or performed in a fasted state [13]. Similarly, a separate review of 11 randomised crossover trials identified only three studies that demonstrated clear ergogenic effects, whereas the majority showed no meaningful improvements [14]. Collectively, these earlier works highlighted substantial heterogeneity across trials in terms of caffeine concentration, rinse frequency, training status, and habitual caffeine intake. Importantly, although cognitive outcomes have been reported in individual studies, no meta-analysis has yet quantitatively synthesised these effects, despite their potential relevance for sports that rely on reaction time, accuracy, and decision-making.
Despite these uncertainties, interest in Caff-MR remains high, and the evidence base has expanded considerably in recent years. Some recent investigations have reported meaningful improvements in both physical [15–17] and cognitive outcomes [18,19], while others continue to find limited or context-dependent effects [20,21]. This growing but conflicting body of evidence underscores the need for an updated and more comprehensive synthesis.
Therefore, this systematic review and three-level meta-analysis aims to provide a precise and up-to-date quantification of the effects of Caff-MR on exercise and cognitive performance compared with placebo. In addition, this work seeks to identify the conditions under which these effects are most pronounced by examining key moderators such as exercise or cognitive task type, caffeine dose, sex, training status, and habitual caffeine use. By clarifying the magnitude, consistency, and contextual relevance of this intervention, the present study addresses ongoing uncertainty in the field and offers meaningful guidance for researchers, practitioners, and athletes who are considering non-ingestive strategies to enhance performance while avoiding gastrointestinal discomfort associated with caffeine ingestion.
Methods
This systematic review and meta-analysis was pre-registered on the Open Science Framework (OSF) on September 25, 2025 (Registration: osf.io/8r24c) and conducted in accordance with the PRISMA 2020 guidelines [22]. The completed PRISMA checklist is available in Electronic Supplementary Material Appendix S1.
Eligibility criteria
2.1.
This review follows the PICOs framework: (1) Participants: healthy adults; (2) Intervention: Caff-MR; (3) Comparison: non-caffeinated placebo mouth rinse; (4) Outcomes: exercise and/or cognitive performance. Specifically, studies were included if they met all of the following criteria: (1) randomised controlled trials (RCTs) published in peer-reviewed journals; (2) investigated the effects of Caff-MR on exercise and/or cognitive performance outcomes; (3) involved healthy human participants; (4) included a non-caffeine control condition such as water or non-caloric placebo rinse; and (5) reported original experimental data in English.
Exclusion criteria were: (1) interventions involving ingestion rather than mouth rinsing; (2) co-interventions with other active ingredients (e.g. carbohydrate, menthol); (3) studies without exercise or cognitive performance-related outcomes; (4) reviews, abstracts, or non-original reports; or (5) insufficient methodological information.
Data sources and search strategy
2.2.
A systematic search was conducted on October 2, 2025, across PubMed, Web of Science, Cochrane Library, Embase, SciELO, and SPORTDiscus. Two separate Boolean strategies were applied:
- (1)"caffeine mouth rinse" OR "caffeine oral rinse" OR "caffeine mouthwash" AND "exercise" OR "performance" OR "endurance" OR "strength" OR "resistance" OR "aerobic" OR "cycling" OR "running" OR "time trial" OR "time-to-exhaustion".
- (2)"caffeine mouth rinse" OR "caffeine oral rinse" OR "caffeine mouthwash" AND "cognition" OR "cognitive performance" OR attention OR memory OR "reaction time" OR "executive function" OR "mental fatigue". No date or filter restrictions were applied.
Data extraction
2.3.
All records were imported into Excel and EndNote 21 for de-duplication. Two independent reviewers (D.H.Z. and S.T.Y.) screened titles, abstracts, and full texts. Discrepancies were resolved by consensus. Extracted data included sample size, sex, training status, feeding condition, rinse solution and duration, and outcome measures (e.g. time to exhaustion, power output, reaction time, accuracy).
When data were not reported numerically, authors were contacted or WebPlotDigitizer (v4.8) was used for extraction [23].
Quality and risk of bias assessment
2.4.
Methodological quality was assessed using a modified version of the Physiotherapy Evidence Database (PEDro) scale, with an additional item evaluating whether the study assessed the effectiveness of blinding to the placebo condition [24]. The total score ranged from 0 to 11, with studies categorised as excellent (10–11), good (7–9), fair (5–6), or poor (<5). Two reviewers (D.H.Z. and S.T.Y.) independently evaluated all included studies; any discrepancies were resolved through discussion, or adjudication by a third reviewer (M.N.N.) if consensus could not be reached.
In parallel, risk of bias was evaluated using the Cochrane Risk of Bias 2 (RoB 2) tool. For crossover trials, we applied the RoB 2 version specifically adapted for crossover designs, which includes an additional domain (Domain S) assessing bias arising from period effects and carryover effects. Accordingly, the following domains were evaluated: randomisation process, deviations from intended interventions, missing outcome data, outcome measurement, selection of the reported result, and period and carryover effects [25]. Assessments were performed independently by the same two reviewers, with disagreements resolved in the same manner.
Statistical analysis
2.5.
Effect size calculation and data synthesis
2.5.1.
All effect size calculations adhered to the Cochrane Handbook for Systematic Reviews of Interventions (Version 6.5, 2024) [26]. Given that the majority of included studies had relatively small sample sizes, Hedges’g was selected to correct for small-sample bias when estimating standardised mean differences (SMDs) between Caff-MR and placebo conditions [27].
Because all included studies employed within-subject crossover designs, additional consideration was required for the paired-sample structure [27]. Specifically, the correlation (r) between paired measurements must be incorporated to accurately estimate standard errors and avoid inflation or underestimation. For studies reporting both pre- and post-exercise values, SMDs were derived from the mean change scores between the two conditions, and the same assumed r value was applied to the pre–post comparisons to maintain consistency and comparability across studies. For those reporting only post-intervention values, differences between Caff-MR and placebo means were used directly while maintaining the within-subject dependency.
As most trials did not report r values, a correlation of r = 0.50 was assumed for the primary analysis [28], with sensitivity analyses conducted at r = 0.20 (lower bound) and r = 0.80 (upper bound) to test robustness [29].
Effect sizes were (g) interpreted using standard thresholds: trivial (<0.2), small (0.2–0.5), medium (0.5–0.8), and large (>0.8) [30]. Detailed computational formulas and step-by-step procedures are provided in Appendix S2.
Three-level meta-analysis and heterogeneity
2.5.2.
To account for multiple outcomes nested within studies, a three-level meta-analysis was performed using the metafor package in R, with restricted maximum likelihood estimation (REML) [31–33]. Variance was partitioned into sampling error (level 1), within-study variance (level 2), and between-study variance (level 3) [34]. Model estimates were cross-validated using maximum likelihood (ML).
Heterogeneity was assessed using I² statistics, categorised as low (0–25%), moderate (25–50%), substantial (50–75%), or considerable (>75%) [28,35]. Prediction intervals (PIE) were calculated to provide context on expected effect distributions in future studies [36,37]. Power analyses were conducted using the metameta package to assess the risk of Type II error [38].
Moderators and subgroup analysis
2.5.3.
To explore between-study heterogeneity and derive more detailed conclusions, meta-regressions and moderator analyses were conducted for both exercise and cognitive outcomes.
For exercise performance, the following moderators were examined:
1) Participant sex.
2) Training status: Participants were classified as untrained or trained according to established participant categorisation frameworks [39]. The trained group included recreationally active, trained/developmental, well-trained/national-level, and elite/international-level athletes, while all others were categorised as untrained.
3) Habitual caffeine intake: Quantitative estimates of daily caffeine intake were extracted using the U.S. Department of Agriculture Food Data Central database reference values. Based on previous standards [40], participants were grouped as low (0–150 mg/day), moderate (150–300 mg/day), high (>300 mg/day) or unclear (habitual intake not reported or insufficient information to classify).
4) Pre-exercise nutritional status: Trials were coded into three categories according to the pre-test interval from the last meal: fed (≤4 h), fasted (>4 h), and unspecified (studies stating “maintain their regular/habitual diet” or similar without specifying an interval). When the reported range straddled the 4-h boundary (e.g. 3–5 h), studies were coded as unspecified to minimise misclassification.
The four-hour threshold for defining the postprandial (fed) state was selected based on previous experimental protocols [41,42] and is commonly used in exercise nutrition research. While the composition of the prior meal can influence gastric emptying and subsequent metabolic responses, a four-hour interval is widely considered sufficiently long for most nutrients to clear the primary absorptive phase [43]. Therefore, using this cutoff provides a practical and physiologically meaningful distinction between fed and fasted states, thereby improving comparability across studies.
5) Exercise task type: To better characterise the ergogenic effects of Caff-MRand improve model stability, outcomes were grouped into four physiological domains according to the predominant energy system or neuromuscular mechanism [29,44]:
(i) Strength/Power: Maximal efforts ≤10 s, primarily relying on the ATP–phosphocreatine system (e.g. vertical jump, medicine-ball throw, short sprint, kicking performance);
(ii) Anaerobic Performance: Efforts lasting >10 s and ≤60 s, predominantly glycolytic (e.g. Wingate test, repeated sprint ability, Taekwondo anaerobic intermittent kick test);
(iii) Muscular Endurance: Sustained submaximal contractions performed to volitional exhaustion or failure, typically involving repeated or continuous efforts targeting a specific muscle group, and generally lasting from ~30 seconds to several minutes depending on protocol (e.g. 2 × 30 s repeated jump, bench press to failure);
(iv) Aerobic Endurance: Continuous efforts exceeding ~2–3 minutes (typically > 10 minutes) where oxidative metabolism provides the predominant energy supply, even when significant anaerobic contributions occur in early phases (e.g. Yo-Yo intermittent recovery test, time-to-exhaustion cycling).
6) Caffeine total exposure dose: Following Nabuco et al., (2023) study [12], total oral exposure to caffeine during mouth rinsing was computed as:
This composite index reflects the overall caffeine mass theoretically available to stimulate oral receptors by integrating concentration (stimulus intensity), rinse volume (surface contact), and rinse frequency (stimulation repetitions), while minimising collinearity among these interrelated parameters.
To further address whether rinse frequency may exert effects beyond those captured by total caffeine exposure, we additionally examined the number of rinses as an independent moderator in exploratory sensitivity analyses, using both continuous and categorical specifications (single: 1; moderate: 2–9; high: ≥10 rinses).
7) Rinse duration: Rinsing time was analysed as a categorical moderator (5 s, 10 s, 15 s and 30 s), as it may modulate receptor activation independently of total caffeine exposure dose.
8) Interaction term (total dose × duration): To examine whether the ergogenic response depends on both the magnitude and duration of oral exposure, an exploratory multiplicative term between total dose and rinse duration (dose × duration) was tested in the meta-regression model.
Given the limited number of trials assessing cognitive performance (k = 7) and the heterogeneity of their outcome measures, we classified all outcomes into two measurement-based domains to enhance statistical comparability and reduce heterogeneity:
- (i)Speed-Based Performance: This domain included outcomes measuring response or completion time (e.g. reaction times in hand/foot/kick tests, mirror-tracing time, Stroop/Simon task reaction times). For these measures, lower values indicate better performance. Effect sizes were therefore sign-adjusted so that a positive pooled effect consistently represents improved performance.
- (ii)Accuracy-Based Performance: This domain included outcomes measuring response accuracy or error rate (e.g. accuracy percentage, error counts in Stroop/Simon tasks, mirror-tracing errors). For these measures, higher accuracy or fewer errors indicate better performance. Their effect direction already aligned with this convention and required no adjustment.
Other potential moderators (e.g. caffeine dose, rinse duration, participant sex, blinding quality) could not be examined quantitatively due to insufficient data.
Mixed-effects multilevel meta-regression models were fitted using restricted maximum likelihood (REML) estimation in the metafor package (rma.mv function), with individual effect sizes nested within studies and t-distribution–based inference. For continuous exploratory moderators, including total caffeine exposure dose, the dose × rinse duration interaction and rinse frequency, both linear and nonlinear (quadratic, cubic) specifications were examined. Model parsimony was determined by the corrected Akaike information criterion (AICc), and higher-order terms were retained only when they substantially improved model fit and were supported by sufficient data coverage [45,46].
All visualisations were generated using ggplot2 and orchaRd packages [47].
Publication bias and sensitivity analyses
2.5.4.
Contour-enhanced funnel plots [48] and Egger’s regression tests [49] (when k ≥ 10) were used to assess publication bias [50].
Sensitivity analyses included: 1) varying assumed correlation coefficients (r): a relatively low value (r = 0.2) and a relatively high value (r = 0.8); 2) leave-one-out analyses; 3) exclusion of outliers identified via Cook’s distance and studentized residuals [51,52]; 4) exclusion of all single-blind studies to evaluate blinding effects on outcomes and 5) exclusion of studies that used rinse volumes other than 25 mL (if applicable), because deviations from this commonly used volume may alter oral contact area, perceived stimulus intensity, and receptor activation, potentially introducing additional heterogeneity. Among these, outlier exclusion was applied not only to the overall models but also across all moderator and meta-regression analyses to ensure robustness of subgroup inferences, whereas the remaining sensitivity cheques were conducted for the main pooled effects.
Certainty of the evidence
2.6.
The certainty of evidence was evaluated using the Grading of Recommendations Assessment, Development, and Evaluation (GRADE) framework, considering risk of bias, inconsistency, indirectness, imprecision, and publication bias [53]. Ratings were categorised as high, moderate, low, or very low. All GRADE assessments were performed independently by one reviewer and verified by a second. Any disagreements were resolved through discussion until consensus was reached.
Results
Studies retrieved
3.1.
The initial search yielded 182 publications: 174 from the primary database search and 8 from other sources. After screening, a total of 31 studies met the inclusion criteria. These studies provided 167 effect size estimates (k = 167), of which 26 studies (k = 114) examined exercise performance and 7 studies (k = 53) examined cognitive performance (Figure 1). Two of these studies assessed both exercise and cognitive outcomes [19,54].
PRISMA flow diagram for included and excluded studies.
Characteristics of included studies
3.2.
Characteristics of exercise performance studies
3.2.1.
Across all studies, a total of 384 participants were included (346 males, 38 females), with sample sizes ranging from 7 to 38. The majority of exercise performance studies recruited only males (n = 22, k = 83), four studies recruited mixed-sex samples (k = 31), and no study recruited exclusively female participants. For training level, 14 studies (k = 57) included trained participants with performance levels ranging from amateur to elite, while 12 studies (k = 57) examined untrained individuals. Furthermore, most studies did not report participants’ habitual caffeine intake (n = 14, k = 50). Among those that did, the majority primarily included low-caffeine consumers (n = 8, k = 55), whereas four studies involved participants with moderate caffeine intake (k = 8). Notably, only one study recruited a mixed sample that included high caffeine consumers (k = 1) [5].
Rinse duration ranged from 5 to 30 seconds. Most studies used 5 seconds (n = 8, k = 41) or 10 seconds (n = 15, k = 54), whereas a single study used 15 seconds (k = 6) and another used 30 seconds (k = 11). Notably, one study included two distinct rinse durations [5]. Almost all studies used 25 mL of Caff-MR, except for two that employed 50 mL [20,55]. The most common caffeine solution concentration was 1.2%, and the number of rinses varied from 1 to 22 depending on the experimental protocol.
According to the type of exercise categorised by the primary underlying energy system or neuromuscular mechanism, 14 studies investigated aerobic endurance (k = 25), 8 examined anaerobic performance (k = 28), 14 assessed strength/power (k = 32), and 5 evaluated muscular endurance (k = 29). Regarding pre-exercise nutritional status, 13 studies (k = 44) were conducted under fed conditions (≤4 hours after the last meal), 7 studies (k = 50) under fasted conditions (>4 hours), and 6 studies (k = 20) did not specify dietary status or stated it as habitual. Notably, one study directly compared the fed and fasted conditions [16].
For more details, please refer to Table 1.
Characteristics of cognitive performance studies
3.2.2.
Across all studies, a total of 217 participants were included (154 males and 63 females), with individual sample sizes ranging from 10 to 65. Among the seven cognitive studies, four recruited only male participants (k = 2), while three involved mixed-sex samples (k = 19), of which one reported pooled result without sex-specific analyses (k = 5). Regarding training status, only two studies (k = 10) involved trained participants, whereas five (k = 43) examined untrained individuals. In terms of habitual caffeine consumption, four studies recruited participants with moderate intake (k = 36), two with low intake (k = 13), and one study mentioned caffeine use without specifying the amount (k = 4).
In the seven cognitive studies, rinse duration was either 10 or 20 seconds. All used 25 mL of mouth rinse, and the number of rinses ranged from one to eight depending on the experimental protocol.
In terms of cognitive task type, seven studies involved speed-based performance (k = 32), and five examined accuracy-based performance (k = 21). With respect to pre-exercise nutritional status, three studies tested participants in the fed state (k = 35), two in the fasted state (k = 14), and one did not specify dietary status (k = 4).
For more details, please refer to Table 2.
Primary analysis
3.3.
Our meta-analysis showed that Caff-MR may be associated with trivial improvements in general exercise performance outcomes (k = 114, g = 0.12, 95% CI [0.04, 0.21], I^2^ = 21% [low], PIE [−0.19, 0.43], p = 0.006, Moderate GRADE) (Figure 2; see also Appendix S3 for a traditional forest plot with study labels and weights).
Primary pooled effect sizes for caffeine mouth rinse on overall exercise and cognitive performance. Notes: K, the total number of effects included in the pooled effect size; Hedge's g, the effect size indicators used in the pooled; 95%CI, 95% confidence interval; 95%PIE, prediction Interval; P-value, statistically significant P values for pooled results; I2, quantitative indicators of heterogeneity; Power, statistical power for pooled effect size; Blue circles, Grade, grading of recommendations assessment, development, and evaluation, a system for evaluating the quality of evidence and strength of recommendations.
For general cognitive performance, the primary meta-analysis did not show consistent benefits of Caff-MR (k = 53, g = 0.23, 95% CI [−0.02, 0.49], I^2^ = 69% [substantial], PIE [−0.51, 0.98], p = 0.07, Very low GRADE) (Figure 2; see also Appendix S3 for a traditional forest plot with study labels and weights).
Variance decomposition from the three-level models indicated that sampling error (level 1) accounted for 73% of the total variance in exercise performance and 25% in cognitive performance (Figure 2). Following Hunter & Schmidt (1990) recommendation [67], when the proportion of total variance attributable to sampling error is <75%, meaningful between-study heterogeneity is likely present. Accordingly, we proceeded to moderator analyses.
Moderator analysis
3.4.
Moderator analysis of exercise performance
3.4.1.
Moderator analyses indicated that in male participants, Caff-MR may be associated with trivial improvements in exercise performance (k = 91, g = 0.14, 95% CI [0.04, 0.23], I² = 28% [moderate], PIE [−0.19, 0.47], p = 0.01; GRADE: Moderate), whereas no consistent enhancement was found in the female subgroup (k = 8, g = 0.11, 95% CI [−0.15, 0.37], p = 0.42; GRADE: Very Low). The between-group difference by sex was not statistically significant (p = 0.76, I² = 21%).
Similarly, Caff-MR produced a trivial improvement in untrained participants (k = 57; g = 0.15; 95% CI [0.01, 0.28]; I² = 31% [moderate]; PIE [−0.21, 0.51]; p = 0.04; GRADE: High) but showed no consistent ergogenic effect in trained participants (k = 8, g = 0.11, 95% CI [−0.15, 0.37], p = 0.42; GRADE: Very Low). The difference between subgroups was not significant (p = 0.62; overall I² = 21%). Across strata of habitual caffeine intake (high, medium, and low), Caff-MR did not yield a consistent ergogenic effect. By contrast, participants with unclear intake exhibited a clear and consistent performance benefit (k = 50; g = 0.23; 95% CI [0.13, 0.32]; I² = 0% [low]; 95% PIE [0.01, 0.44]; p < 0.01; GRADE: Very low), with the effect size significantly greater than that observed in the low- and medium-intake strata (p < 0.05).
Caff-MR was associated with a small improvement in exercise performance under fed conditions (k = 44; g = 0.22; 95% CI [0.12, 0.32]; I² = 48% [moderate]; 95% PIE [−0.01, 0.44]; p < 0.01; GRADE: Low). In contrast, under fasting conditions the effect was near zero and imprecise (k = 50; g = 0.01; 95% CI [−0.10, 0.12]; p = 0.84; GRADE: Low). The fed-state effect was significantly greater than the fasting-state effect (p < 0.05).
Across exercise types, Caff-MR demonstrated a consistent ergogenic effect on aerobic endurance performance (k = 25; g = 0.21; 95% CI [0.01, 0.35]; I² = 41% [moderate]; 95% PIE [−0.12, 0.54]; p < 0.01; GRADE: Moderate). By contrast, effects on anaerobic performance, muscular endurance, and strength/power were inconsistent and not statistically significant (all p > 0.05), and no between-group differences were detected (interaction p > 0.05).
Interestingly, mouth rinsing for 10, 15, or 30 seconds did not yield a consistent ergogenic effect, whereas the shortest duration of 5 seconds did (k = 41; g = 0.23; 95% CI [0.09, 0.36]; I² = 39% [moderate]; 95% PIE [−0.08, 0.53]; p < 0.01; GRADE: Low). Furthermore, the 5-second effect size was significantly greater than that observed with 10 seconds (p < 0.05).
For more information on exercise performance subgroup results, please refer to Figure 3.
*Moderator analysis for exercise primary results. Notes: K, the total number of effects included in the pooled effect size; Hedge's g, the effect size indicators used in the pooled; 95%CI, 95% confidence interval; 95%PIE, prediction Interval; P-value, statistically significant P values for pooled results; I2, quantitative indicators of heterogeneity; Power, statistical power for pooled effect size; GRADE, grading of recommendations assessment, development, and evaluation, a system for evaluating the quality of evidence and strength of recommendations; , represents significant differences between groups; †, represents significant difference between the two categories within the group.
After harmonisation, five studies [20,54,64,68,69] were excluded from the dose–response analysis due to incomplete or inconsistent dosing information (for example, doses reported only in mg·kg⁻¹, missing concentration data, or inconsistent rinse counts relative to the target exposure window). Given the wide range of total exposures, we performed a meta-regression on log₁₀-transformed total caffeine exposure and found no linear association with performance (β = −0.02, p = 0.79) (Appendix S4).
However, a quadratic model indicated a significant U-shaped relationship (β₁ = −1.34, p = 0.02; β₂ = 0.24, p = 0.03). Based on this model, the earliest point at which a stable ergogenic effect emerged was at log₁₀(dose) = 1.05, corresponding to the lowest included total exposure of 32 mg, after which the effect size diminished with increasing dose and reached a nadir at the turning point log₁₀(dose) = 2.81 (≈646 mg). The dose window supporting a stable ergogenic effect was approximately 32–133 mg (Appendix S5). A cubic specification provided a poorer fit and yielded no significant terms (Appendix S6).
Exploratory analyses further examined rinse frequency as an independent moderator. When modelled categorically, moderate rinse frequencies (2–9 rinses) were associated with a small improvement in exercise performance (k = 51, g = 0.17, 95% CI [0.06, 0.28], p < 0.01), and this pattern persisted after exclusion of outliers (k = 39, g = 0.25, 95% CI [0.13, 0.36], p < 0.01). In contrast, single-rinse and high-frequency protocols did not show statistically significant benefits (p > 0.05). When rinse frequency was treated as a continuous variable, no significant linear association with performance was observed. However, quadratic meta-regression identified a statistically significant inverted U-shaped association (β₁ = 0.04, p = 0.01; β₂ = −0.01, p = 0.01), suggesting that ergogenic effects may be more detectable at intermediate rinse frequencies, with attenuated effects at lower and higher frequencies. For more details, please refer to Appendix S7.
Finally, in the exploratory linear model relating rinse time × total exposure, rinse time was treated as a categorical factor (5 s vs 10 s) and total caffeine exposure as a continuous predictor, with dose centred at 32 mg so that the intercept reflects the minimum exposure. There was no significant linear association between dose and performance within either the 5-second or 10-second conditions, and no evidence that the dose–response differed between these conditions (Appendix S8). The 30-second condition was excluded due to unclear total exposure, and the 15-second condition could not be examined in the interaction model because it appeared in only one study with insufficient variation in exposure [16].
Moderator analysis of cognitive performance
3.4.2.
Moderator analyses indicated that Caff-MR did not consistently enhance performance on either speed-based cognitive tasks (k = 32; g = 0.26; 95% CI [−0.01, 0.51]; p = 0.051; GRADE: Very low) or accuracy-based tasks (k = 21; g = 0.19; 95% CI [−0.08, 0.47]; p = 0.16; GRADE: Low). No significant between-task differences were detected (p = 0.43; I² = 67%). For more information on cognitive performance subgroup results, please refer to Appendix S9.
Risk of bias and quality of methods
3.5.
Across exercise outcomes, no study was rated overall “low risk”. Several studies were judged as “high risk” [8,15,70], with the remainder classified as having “some concerns.” At the domain level, risk of bias was generally low for missing outcome data and outcome measurement, whereas concerns were more common for aspects related to randomisation, reporting practices, and deviations from intended interventions. Assessments of bias arising from period and carryover effects were mixed, with many studies rated as having some concerns and the remainder as low risk. Overall, the exercise evidence base reflects a predominantly moderate risk-of-bias profile, with high-risk judgments primarily attributable to insufficient reporting of random sequence generation and allocation concealment, unclear or absent pre-registration and specification of primary outcomes, and limited detail to verify adherence to the intended mouth-rinse intervention (Appendix S10).
For cognitive performance, similarly, no study achieved an overall “low risk” rating. One study was assessed as “high risk” [21], with the remainder judged as having “some concerns.” Across domains, most studies showed low risk for missing outcome data, while concerns were more frequent for reporting practices, randomisation-related procedures, and deviations from intended interventions. Bias arising from period and carryover effects was generally low, with some concerns noted in a minority of studies. Taken together, the cognitive evidence indicates a predominantly moderate risk-of-bias landscape with isolated high-risk assessments (Appendix S10).
To improve methodological quality, future studies should adopt transparent and well-documented randomisation procedures, pre-register study protocols with clearly defined primary outcomes, and report intervention fidelity in detail, particularly with respect to mouth-rinse volume, duration, and washout control. Adherence to established reporting guidelines for randomised crossover trials may further reduce risk-of-bias concerns and strengthen internal validity.
Funnel plots, together with Egger’s regression, indicated no significant publication bias for exercise performance (p = 0.10), whereas cognitive performance showed significant asymmetry (p < 0.01) (Appendix S11). Median statistical power was low in both domains (7.3% for exercise; 26.4% for cognition), and the R-index suggested poor replicability (3.3% and 26.3%, respectively). These diagnostic results suggest the need for cautious interpretation of pooled effects (Appendix S12).
In the moderator analyses, Egger’s regression was applied to all subgroups with at least 10 studies (k ≥ 10). For exercise outcomes, significant funnel-plot asymmetry was detected for subgroups with unclear habitual caffeine intake, for both fed and fasted pre-test nutritional states, and for the 5-second mouth-rinsing duration (all p < 0.05), whereas no evidence of publication bias was observed in the remaining subgroups. For cognitive outcomes, the speed-based performance subgroup showed evidence of asymmetry (p < 0.05). Further details are provided in Appendix S11.
For exercise outcomes, the mean modified PEDro score was 7.3, consistent with good methodological quality. For cognitive outcomes, the mean score was 6.6, indicating fair methodological quality. More information, please refer to Appendix S13. In addition, the certainty of evidence for each outcome was evaluated using the GRADE approach, with details provided in Appendix S14.
Sensitivity analysis
3.6.
Sensitivity analysis for primary effect
3.6.1.
For exercise performance, results were generally robust across sensitivity analyses. In the between-study (level-3) leave-one-out analysis, exclusion of the Taheri Karami et al. (2023) study [17] yielded a modest shift in the pooled estimate that was borderline non-significant (g = 0.11, p = 0.052) and eliminated detectable heterogeneity (I² = 0%). All other sensitivity cheques did not materially affect the direction or magnitude of the pooled effect (Appendix S15 and S16).
For cognitive performance, assuming r = 0.8 and after excluding statistical outliers, Caff-MR showed a significant ergogenic effect (p < 0.05) (Appendix S15). In the level-3 leave-one-out analysis, exclusion of the Virdinli et al. study [66] reduced heterogeneity from substantial to 0% (Appendix S16).
Sensitivity analysis for moderator effect
3.6.2.
Within the three-level framework, outlier diagnostics substantially altered several moderator findings. After removing influential cases, several exercise-based subgroups were excluded due to insufficient data: the female subgroup (sex), the medium- and high-intake strata (habitual caffeine intake), and the 15-s rinsing-duration stratum. Additionally, three subgroups that were non-significant in the primary analysis became statistically significant: the trained subgroup (g = 0.13; k = 51; p = 0.01), the unspecified feeding-state subgroup (g = 0.12; k = 19; p = 0.02), and the strength/power subgroup (g = 0.13; k = 29; p = 0.01) (Appendix S17). The linear dose–response relationship remained non-significant (Appendix S4), while the previously observed quadratic, non-linear dose–response was no longer evident (Appendix S6). By contrast, the dose × time analysis indicated that a stable negative linear association emerged between total exposure and performance under the 5-s condition (β = −0.18; k = 63; p = 0.02) (Appendix S8).
For cognitive outcomes, after excluding outliers, Caff-MR showed a robust ergogenic effect on speed-based tasks (Appendix S17).
Discussion
In contrast to previous meta-analytic conclusions of a nonsignificant overall effect [12], the present synthesis demonstrates that Caff-MR can provide small but measurable benefits for exercise performance, with more context-dependent effects on cognitive outcomes. These results indicate that the effects of Caff-MR are not uniform, but vary according to how central perception and motivation interact with task demands and dosing features. The following sections therefore examine how participant characteristics, exercise and cognitive task type, nutritional status, and rinsing strategy influence the detectability and practical relevance of Caff-MR effects.
Exercise outcomes
4.1.
Participant characteristics
4.1.1.
Across sex, our primary models suggested a trivial, directionally positive effect in men, whereas women showed no consistent benefit. The sex-by-subgroup contrast was not significant. The field lacks female-specific evidence. No study enroled an exclusively female cohort, and the available estimates for women come from subgroup reports within mixed-sex samples with very low statistical power [54]. Furthermore, they did not report on menstrual cycle control or hormonal contraception, which have been shown to potentially influence increases in perceptual and performance endpoints [3,71]. Mechanistically, Caff-MR is expected to act through oropharyngeal chemosensation and rapid engagement of reward and motor networks such as the insula, orbitofrontal cortex, and striatum, processes that are unlikely to be sex-limited [6,72]. Taken together, the absence of a consistent effect in women most likely reflects limited statistical power and design limitations rather than a true lack of responsiveness.
Untrained participants showed more consistent trivial gains, whereas effects in trained participants were of similar magnitude but less stable. After outlier removal, the trained subgroup effect became statistically robust. Two factors likely account for the initial instability. Biologically, trained athletes begin closer to performance ceilings and display lower within-person variability [3], which may limit the headroom for the motivation-related central effects hypothesised for Caff-MR to translate into measurable improvements. Methodologically, studies in this area are generally small, and those that recruit trained cohorts are smaller still [5,58,60]. Small samples increase the leverage of individual studies on pooled estimates and make signals appear unstable even in the absence of publication bias [26]. Beyond mouth rinse, findings from the broader caffeine literature also suggest that training status seldom acts as a reliable moderator for resistance-type outcomes [73–75]. This pattern supports a cautious interpretation that the apparent trained–untrained differences in our dataset are more sensitive to sampling and design features than to large, systematic physiological moderation by training status.
Regarding habitual caffeine intake, effects diverged by stratum. Both moderate and high consumers showed slightly negative estimates. However, evidence is sparse for moderate and high consumers. Only one dataset classified as high intake came from a subgroup report with 6 participants [5], so any inference is unstable. Interestingly, studies with unclear intake showed a clear positive effect with a prediction interval that did not include zero. Because a mouth rinse does not depend on systemic absorption from current evidence, classic pharmacological tolerance cannot fully explain this pattern. Instead, two explanations are more plausible. First, habitual use and expectancy may blunt central drive even without systemic exposure. Parts of the ingestion literature report attenuated acute benefits in heavy users, consistent with habituation of adenosinergic and motivational pathways [76,77]. Second, the “unclear” category likely reflects reporting and selection artifacts. Poor dietary reporting often co-occurs with other methodological limitations, and funnel-plot asymmetry was observed in related subgroups, both of which may inflate apparent effects [3,12]. Notably, for low-intake participants, the pooled effect clustered around zero which is positive but imprecise. While this is currently difficult to explain at the physiological level, it should not be interpreted as evidence of no effect, especially given that most studies only reported the mean of habitual intake within the group, a practice that can lead to significant misclassification errors [3,76].
Taken together, the apparent moderating patterns across sex, training status, and habitual caffeine use should be interpreted cautiously. Given that the overall effects of Caff-MR are small and subgroup estimates are inconsistent, differences in detectability rather than true biological moderation are a more parsimonious explanation. In this context, practical factors such as baseline performance level, expectancy-related central drive, and study design characteristics may disproportionately influence whether small effects are observed. These patterns therefore warrant confirmation in targeted, adequately powered studies.
Pre-exercise nutritional status
4.1.2.
The popularity of mouth rinse is largely due to its ability to bypass the gastrointestinal tract, thereby avoiding the delayed absorption and gastrointestinal discomfort that may accompany ingestion, while activating the central nervous system [3,5]. In practice, however, most athletes are not truly fasted before performance; they typically arrive fed to maximise substrate availability [43]. Distinguishing fed from fasted testing is therefore crucial for interpreting real-world relevance.
Consistent with this context, our subgroup analysis showed that fed-state testing was associated with a small improvement, whereas the fasted-state effect was near zero and imprecise, and the fed effect was significantly greater than the fasted effect.
A plausible mechanism is that, in the fed state, brief oropharyngeal stimulation may enhance central drive via insular, orbitofrontal, and striatal pathways at a time when metabolic substrates are readily available. Under these conditions, sufficient carbohydrate availability and habitual feeding patterns may allow small centrally mediated cues to translate into more stable regulation of pacing and perceived effort during sustained exercise [6,72]. In contrast, under fasting conditions, substrate availability may be reduced and perceived exertion tends to rise, particularly in individuals who are not accustomed to fasted exercise. These factors can increase interoceptive strain and variability in effort regulation, thereby masking subtle centrally mediated benefits of Caff-MR [43,78].
After outlier exclusion, the unspecified feeding-state subgroup showed a stable ergogenic signal. This likely reflects incomplete reporting in studies that instructed participants to maintain their habitual diet. Many such studies probably tested participants within 4 hours of a meal but without precise documentation of meal timing. Consequently, participants’ physiological state may have been closer to the fed condition rather than true fasting, which could increase the likelihood of detecting performance benefits.
However, the role of nutritional status remains unclear. One recent randomised trial reported similar improvements in a running task irrespective of feeding status [16], whereas our pooled analysis showed that fed testing yielded larger effects than fasted testing on average. These divergent findings suggest that nutritional status may interact with task characteristics and protocol details to influence effect detectability. Therefore, targeted studies are needed to clarify these relationships.
Exercise type
4.1.3.
Our analysis indicates that Caff-MR is most reliable for aerobic endurance, with a small but consistent benefit, whereas effects for anaerobic performance, muscular endurance, and strength/power were inconsistent overall. This divergent pattern can be explained by the differing role of central versus peripheral factors across exercise modalities.
During prolonged aerobic tasks (typically > 3 min), performance is largely governed by pacing strategies, attentional focus, and effort perception, which are amenable to central modulation [79–81]. Caff-MR operates through brief oropharyngeal stimulation of brain networks regulating motivation and perceived exertion [72], with neurophysiological evidence showing rapid activation of insular and orbitofrontal cortices [6,78]. Importantly, most aerobic endurance protocols in this review applied repeated mouth-rinse exposures during exercise, with only three studies relying on a single rinse [11,55,58]. Under such conditions, transient centrally mediated signals may be periodically refreshed, providing sufficient opportunity to influence pacing decisions in real time. Over the course of prolonged exercise, even small reductions in perceived effort may therefore accumulate into measurable performance gains.
Conversely, maximal strength and power efforts (around 5-10 s) depend almost exclusively on immediate ATP-phosphocreatine availability and explosive motor unit recruitment [82], leaving minimal opportunity for cognitive-perceptual modulation to enhance peak output [1,83]. Notably, after outlier exclusion, we detected a stable trivial effect (g = 0.11) in strength/power outcomes, suggesting a small yet reliable central contribution (e.g. enhanced neural drive or reduced pre-motor inhibition) that becomes apparent when methodological noise is controlled.
Anaerobic performance (10–60 s) represents a transitional zone where glycolytic capacity and effort regulation both contribute [83]. In brief single-bout efforts (<30 s), peripheral metabolic constraints (pH decline, phosphocreatine depletion) likely dominate, limiting Caff-MR's central influence. In repeated-bout protocols, central cues may modestly aid effort maintenance across sets, though our evidence suggests this effect remains small and variable. Similarly, muscular endurance tasks showed equivocal results, as performance is primarily constrained by local muscle fatigue and metabolite accumulation rather than central pacing [84,85].
Collectively, these findings support a task-dependency model wherein Caff-MR efficacy is maximal when performance relies on sustained effort regulation (aerobic endurance) and minimal when outputs are peripherally dominated (strength/power, short anaerobic efforts), extending prior observations with quantitative evidence across modalities [12,14].
Dosing strategy
4.1.4.
Our synthesis indicates that shorter rinse durations (~5 s) are more likely to yield reproducible, trivial-to-small ergogenic effects than 10–30 s, though the limited number of studies at 15 and 30 s warrants cautious interpretation. Increasing total oral exposure does not produce a linear improvement. In primary models, we observed a non-linear pattern in which benefits emerged at relatively low exposure and attenuated as exposure increased. However, this pattern lost robustness after outlier handling. Moreover, within the 5 s stratum, higher total exposure was negatively associated with performance, arguing against a simple “more or longer is better” assumption."
Mechanistically, a short oropharyngeal chemosensory pulse may sufficiently activate bitter and trigeminal receptors to transiently recruit insula–orbitofrontal–striatal networks implicated in motivation and effort regulation, providing rapid centrally mediated facilitation independent of systemic pharmacokinetics [6,72]. On the contrary, prolonged or excessive bitter or chemesthetic stimulation risks perceptual adaptation or crossing aversive thresholds, which can blunt motivational gain and offset benefits [43]. In addition, timing the rinse immediately before key pacing or force-production phases may strengthen perception action coupling, whereas spreading stimulation across longer windows may dilute attentional engagement and reduce the functional impact of the cue [86].
Within this framework, rinse frequency becomes relevant because it determines how often this transient signal is refreshed during exercise. In our exploratory moderator analyses, intermediate rinse frequencies were more consistently associated with detectable ergogenic effects, whereas both single rinse and very high frequency protocols showed less reliable benefits. This interpretation is consistent with the aerobic endurance subgroup, which predominantly used repeated rinses and showed a stable benefit overall, while the three endurance studies that relied on a single rinse did not report clear performance improvements [11,55,58]. At the other extreme, very frequent rinsing may interrupt pacing continuity or attentional focus and introduce task-level interference that counteracts potential central facilitation. Taken together, these findings suggest that the performance impact of Caff-MR is shaped not only by stimulus intensity and duration, but also by how often brief chemosensory pulses are delivered across the exercise bout.
Two included studies that explicitly examined dose response reported greater benefits with higher doses [9,17]. These findings can be interpreted within the same context-dependent framework rather than as evidence for a uniformly positive dose response. Tallis et al. used a 25 mL coffee-based rinse [17], which may have produced a different sensory profile than a pure caffeine solution, so palatability, caffeine preference, or interindividual variability in taste perception or adenosine-related sensitivity could have contributed beyond concentration alone. Karayigit et al. observed benefits only at an extreme concentration of 3 percent for muscular endurance [9]. When expressed as total oral exposure, it corresponds to the ascending limb of the upper range of our U-shaped dose–response curve, beyond the turning point (≈646 mg). Although pooled estimates did not show a stable effect in this high exposure region, the Karayigit et al. result suggests that very high exposures may confer benefits in specific contexts [9], potentially shaped by task demands, timing of repeated exposures, sensory habituation, or participant characteristics. Notably, both studies involved multiple rinses, further supporting the view that apparent dose effects may reflect protocol structure as much as total caffeine exposure.
Collectively, these data support a conservative strategy for Caff-MR: favour ~5-s rinses delivering low-to-moderate total oral exposure, avoid escalating dose or duration in pursuit of larger effects, and when repeated exposures are used, prioritise brief pulses aligned with decision-relevant phases rather than prolonged or highly disruptive rinsing schedules.
Cognitive outcomes
4.2
In the primary analysis, Caff-MR did not yield a stable effect on overall cognition or on either subgroup (speed- or accuracy-based). After removing outliers, the overall cognitive effect became significant, and the speed-based subgroup also reached significance, suggesting that Caff-MR may preferentially facilitate processing speed rather than accuracy.
This pattern is mechanistically plausible. Brief oropharyngeal chemosensory stimulation can rapidly engage insula, orbitofrontal, and anterior cingulate networks, leading to transient increases in cortical arousal and attentional control [6,72]. Such effects are well suited to shortening response latency, whereas accuracy typically depends on more sustained executive control and inhibitory processes that may be less responsive to brief sensory stimulation [87]. Consistent with this interpretation, mouth-rinse studies and the broader caffeine literature more often report faster responses or preserved processing speed under cognitive load or fatigue, while effects on accuracy tend to be smaller and more variable [54,88,89]. Together, the evidence supports context-dependent cognitive benefits with the greatest sensitivity in speed-based tasks, while underscoring the need for standardised, adequately powered designs to confirm durability and boundary conditions of these effects.
Future directions
4.3
Building on the present synthesis, several priorities should guide future work.
First, dosing strategy requires purpose-built, adequately powered, preregistered trials that orthogonally vary total oral exposure and rinse duration, with standardised reporting of concentration, volume, frequency, and sensory characteristics. Trials should pre-specify correlation handling for crossover designs and include blinding cheques to reduce expectancy confounds.
Second, future studies should directly test how exercise type affects Caff-MR outcomes. Trials that systematically vary exercise type alongside dosing parameters are needed to clarify when centrally mediated effects translate into meaningful performance gains, particularly in endurance tasks where pacing and effort perception are critical. Cognitive outcomes should be assessed using separate reaction time and accuracy metrics to better isolate domain-specific sensitivity.
Third, combined mouth-rinse strategies warrant further investigation. Existing evidence suggests that carbohydrate–caffeine [90,91] and menthol–caffeine [92] formulations may produce additive or synergistic effects through complementary sensory pathways. Future studies should evaluate these and other multi-ingredient combinations using tightly controlled designs that standardise sensory profiles, verify taste-matching procedures, and incorporate affective measures to determine whether perceptual and emotional pathways contribute to performance outcomes.
Fourth, pre-exercise nutritional status requires systematic investigation. Although our analyses suggest more stable effects under fed conditions, direct comparisons between fed and fasted testing remain scarce and yield inconsistent findings [16]. Future trials should explicitly control and report preprandial nutritional status to determine whether feeding state represents a true physiological moderator and to identify the contextual factors governing these interactions.
Fifth, because Caff-MR bypasses gastrointestinal transit and aims for rapid central engagement, intranasal caffeine merits exploratory investigation as a mechanistically adjacent route. Existing reviews highlight the theoretical potential of caffeinated nasal sprays to stimulate cranial nerve pathways and permit mucosal absorption [93]. However, evidence remains limited, as the only two studies have not demonstrated meaningful improvements in exercise or cognitive performance, nor detectable increases in systemic caffeine concentrations [94,95]. Moreover, apart from electrophysiological findings, neuroimaging data and direct assessments linking central activation to functional performance outcomes are lacking. Accordingly, early-phase studies should prioritise safety, pharmacokinetics, central nervous system markers, and head-to-head comparisons with mouth rinsing, followed by pragmatic performance trials and combination protocols to test for additive effects.
Sixth, chemical and formulation constraints should be made explicit. Caffeine’s water solubility at room temperature is approximately 20 mg·mL⁻¹, which equates to a practical ceiling near 2% in a 25 mL rinse [96,97]. One included study reportedly used 3% without full disclosure of preparation methods [9], which introduces potential bias and reproducibility risks. Future reports should detail solvent systems, temperature, pH, and stabilisers, and should verify concentration analytically to ensure methodological transparency and cross-study comparability.
Seventh, to date, only one study has measured post–mouth-rinse blood caffeine and reported no meaningful increase [5]. Future trials should incorporate standardised pharmacokinetic sampling (pre-exercise baseline and serial draws within 5–30 min), alongside central readouts (e.g. EEG/ERP or fNIRS), to further confirm that observed effects are centrally mediated rather than systemic and to refine dose–timing recommendations.
Finally, several populations warrant targeted investigation due to specific physiological and practical factors that may modify Caff-MR responses. Future trials should recruit and prospectively stratify women with controlled menstrual-cycle phase or hormonal contraceptive use, as sex hormones may influence central sensitivity, perceptual responses, and variability in performance outcomes. Highly trained athletes represent another priority group because they operate closer to performance ceilings, where small centrally mediated effects may be harder to detect yet practically meaningful. Habitual high-caffeine users also merit focused study, given potential differences in expectancy, tolerance-related neural adaptation, and receptor responsiveness that may modify responses to non-ingestive caffeine strategies. In addition, teenagers or older adults remain largely unexplored populations in whom caffeine metabolism, sensory processing, and central responsiveness may differ from young adults. In parallel, genotype should be a prespecified moderator. The only genotyped study to date classified runners by CYP1A2 and found no performance differences with a 1.2% rinse, but the sample was small and predominantly composed of C-allele carriers, limiting inference [11]. Future work should therefore include ADORA2A, CYP1A2, TAS2R, and exploratory COMT variants to test whether genetic susceptibility shapes perceptual salience and central effects independent of systemic exposure.
Strengths and limitations
4.4
This review provides the most comprehensive and methodologically rigorous synthesis of Caff-MR effects to date. We searched six databases, applied duplicate screening and extraction, and used a three-level meta-analytic framework that accounts for multiple outcomes within studies and avoids double counting. We quantified prediction intervals, conducted robust sensitivity analyses including outlier diagnostics and correlation assumptions for crossover designs, and evaluated evidence certainty via GRADE, which improves the transparency and decision utility of our conclusions.
Several limitations warrant consideration. First, despite the three-level approach, heterogeneity remained for multiple outcomes, reflecting variation in participants, dosing protocols, sensory properties, timing, and tasks. Second, the female evidence base is sparse, and most trained cohorts were small, which limits precision and generalisability. Third, age was not systematically restricted or analysed in our synthesis, which may obscure age-related differences in sensory perception, expectancy, or central responsiveness. Fourth, dose-reporting was inconsistent across primary studies. Several trials lacked complete concentration or rinse-frequency data, which constrained our dose–response modelling and required exclusions after harmonisation. Fifth, the estimates for 15 s and 30 s rinse durations rely on very few studies, so inferences for these durations are not robust. Sixth, analyses treating rinse frequency as an independent moderator were exploratory and should be interpreted cautiously. The categorical cut points were data driven and may not map cleanly onto practical protocols, and rinse frequency is tightly coupled to task design, duration, and exercise modality, which reduces interpretability as a stand-alone dose parameter. Finally, cognitive outcomes exhibited funnel-plot asymmetry and low statistical power, raising the possibility of small-study effects. These limitations highlight the need for caution in interpreting our findings and underscore the importance of high-quality, standardised research in this area.
Practical implications
4.5
Caff-MR may be considered a situational ergogenic aid rather than a universal performance strategy. Based on the present findings, several practical implications can be drawn for athletes, coaches, and sport scientists:
First, Caff-MR appears most relevant for aerobic endurance tasks, where pacing and effort regulation play a central role. Practitioners may consider its use during prolonged or self-paced endurance exercise, particularly when athletes are tested or compete in a fed state, which better reflects real-world practice.
Second, brief rinsing durations and low-to-moderate total exposure appear sufficient. Exploratory dose mapping suggests that very short rinses of approximately 5 seconds can elicit comparable benefits to longer protocols, while higher total exposure is unlikely to provide additional advantage. From a practical standpoint, this favours simple, time-efficient protocols that minimise disruption to exercise rhythm.
Third, the ergogenic benefit of Caff-MR is modest and context-dependent. Athletes and coaches should view Caff-MR as an optional adjunct rather than a primary ergogenic strategy. It may be particularly relevant as an alternative when conventional caffeine ingestion is impractical or poorly tolerated due to gastrointestinal discomfort or individual sensitivity.
Finally, applications for cognitive enhancement should be approached cautiously. Although some evidence suggests sensitivity of processing speed to Caff-MR, effects on cognitive accuracy and broader executive outcomes remain inconsistent. Sport scientists integrating Caff-MR into cognitive or dual-task training should therefore temper expectations and consider it exploratory rather than established practice.
Taken together, these findings support the selective and context-aware use of Caff-MR within endurance-focused training or competition settings, while highlighting the need for individualised decision-making and further protocol refinement.
Conclusion
Although the effect is small, current evidence indicates that Caff-MR provides modest but reliable performance benefits, predominantly in aerobic endurance tasks. Effects appear optimised with brief (~5 s) rinses and low-to-moderate total exposure, while cognitive findings are mixed but show greater sensitivity for processing speed after outlier removal. Given heterogeneity, limited female/trained-athlete data, and methodological variability, high-quality, well-powered trials with standardised dosing and detailed reporting are needed to confirm reliability and refine practical protocols.
Perspective
This meta-analysis offers the most comprehensive evaluation to date of Caff-MRas a rapid, ingestion-free ergogenic strategy. Our findings highlight small but meaningful benefits, particularly in aerobic endurance and processing speed, and show that these effects can be achieved with brief (~5 s) rinses and moderate total exposure, without the need for systemic absorption. Building on prior work in Caff-MR [12–14], this study reinforces the relevance of central, oropharyngeal mechanisms and positions Caff-MR as a practical option for athletes seeking performance support when ingestion is impractical or undesirable. At the same time, the results underscore the need for standardised protocols, better reporting, and sex- and training-specific trials to refine dosing, timing, and combined rinse strategies within sport and exercise science.
Supplementary Material
Supplementary MaterialJISSN_258198509_R2_Electronic_Supplementary_Material
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Grgic J, Grgic I, Pickering C, et al. Wake up and smell the coffee: caffeine supplementation and exercise performance—an umbrella review of 21 published meta-analyses. Br J Sports Med. 2020;54(11):681–688. doi: 10.1136/bjsports-2018-10027830926628 · doi ↗ · pubmed ↗
- 2Spriet LL. Exercise and sport performance with low doses of caffeine. Sports Med. 2014;44(2):175–184.10.1007/s 40279-014-0257-8PMC 421337125355191 · doi ↗ · pubmed ↗
- 3Guest NS, Van Dusseldorp TA, Nelson MT, et al. International society of sports nutrition position stand: caffeine and exercise performance. J Int Soc Sports Nutr. 2021;18(1):1. doi: 10.1186/s 12970-020-00383-433388079 PMC 7777221 · doi ↗ · pubmed ↗
- 4Pickering C. Are caffeine’s performance-enhancing effects partially driven by its bitter taste? Med Hypotheses. 2019;131:109301. doi: 10.1016/j.mehy.2019.10930131443771 · doi ↗ · pubmed ↗
- 5Doering TM, Fell JW, Leveritt MD, et al. The effect of a caffeinated mouth-rinse on endurance cycling time-trial performance. Int J Sport Nutr Exerc Metab. 2014;24(1):90–97. doi: 10.1123/ijsnem.2013-010323980239 · doi ↗ · pubmed ↗
- 6De Pauw K, Roelands B, Knaepen K, et al. Effects of caffeine and maltodextrin mouth rinsing on P 300, brain imaging, and cognitive performance. J Appl Physiol (1985). 2015;118(6):776–782. doi: 10.1152/japplphysiol.01050.201425614603 · doi ↗ · pubmed ↗
- 7Beaven CM, Maulder P, Pooley A, et al. Effects of caffeine and carbohydrate mouth rinses on repeated sprint performance. Appl Physiol Nutr Metab. 2013;38(6):633–637. doi: 10.1139/apnm-2012-033323724880 · doi ↗ · pubmed ↗
- 8Bottoms L, Hurst H, Scriven A, et al. The effect of caffeine mouth rinse on self-paced cycling performance. Comparative Exercise Physiology. 2014;10(4):239–245. doi: 10.3920/CEP 140015 · doi ↗
