A meta-analytic reliability generalization study of the Bedtime Procrastination Scale
Esra Oyar, Serpil Çelikten-Demirel, Ayşenur Erdemir

TL;DR
This study evaluates the reliability of the Bedtime Procrastination Scale across multiple studies and finds it to be generally reliable, though with some variability influenced by factors like age and sample type.
Contribution
The study provides a meta-analytic reliability generalization of the Bedtime Procrastination Scale across multiple studies and identifies significant moderators of reliability.
Findings
The Bedtime Procrastination Scale has a high Cronbach’s alpha (0.855) and McDonald’s omega (0.867), indicating strong reliability.
Moderators like age, region, and sample group significantly influence Cronbach’s alpha reliability.
Publication and reporting bias were not detected, but unexplained heterogeneity remains.
Abstract
Bedtime procrastination is defined as deliberately delaying sleep without any external conditions preventing sleep. One of the most frequently used scales in this field is the Bedtime Procrastination Scale (BPS). The original form of the scale consists of nine items rated on a 5-point Likert scale. The BPS is a measurement tool that has been applied to many cultures, both in the language in which it was developed and in adaptations to different languages. This study aims to examine the reliability coefficients obtained from different studies for the BPS using meta-analysis methods and to determine the average effect size for the scale. For this purpose, studies were searched in the Scopus, Proquest, Web of Science, ScienceDirect, EBSCO, and Google Scholar databases between 2014 and 2025 using the keyword “Bedtime Procrastination Scale,” and analyses were performed on 128 reliability…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4| Moderators | Cronbach’s alpha (f) | McDonald’s omega (f) | |
|---|---|---|---|
| Categorical moderators | |||
| Publication type | Article | 118 | 11 |
| Thesis | 9 | – | |
| Scale language | Chinese | 49 | – |
| English | 47 | – | |
| Other languages | 31 | – | |
| Region | Asia | 84 | – |
| Europe | 25 | – | |
| Sample group | Adolescents | 12 | 1 |
| General population | 33 | 6 | |
| University students | 76 | 4 | |
| Not reported | 7 | – | |
| Quantitative moderators | |||
| Mean | Standard deviation | ||
| Age | 22.61 | 9.85 | |
| BPS score | 3.14 | 0.83 | |
| Study tag | Publication type | Alpha | Omega | Sample size | Language of BPS | Country | Age (Mean ± SD) | Sample group | Female ratio | BPS (Mean ± SD) |
|---|---|---|---|---|---|---|---|---|---|---|
|
| Article | 0.782 | 510 | Arabic | Lebanon | 16.15 (3.24) | Adolescent | 74.90 | 3.07 (0.71) | |
|
| Article | 0.640 | 495 | English | Saudi Arabia | 20.89 (2.01) | University stu. | 61.84 | 3.12 (0.76) | |
|
| Article | 0.910 | 999 | Chinese | China | 21.16 (1.6) | General pop. | 74.87 | 3.0 (0.8) | |
|
| Article | 0.870 | 488 | English | India | University stu. | 50.20 | 2.92 (0.66) | ||
|
| Article | 0.890 | 185 | German | Germany | 21.73 (4.18) | University stu. | 86.40 | 3.29 (0.58) | |
| Article | 0.850 | 137 | German | Germany | 14.41 (0.6) | Adolescent | 51.10 | 2.97 (0.59) | ||
|
| Article | 0.880 | 74 | English | USA | 25.68 (7.67) | General pop. | 89.20 | 3.07 (0.7) | |
|
| Article | 0.830 | 177 | Spanish | Spain | 23.1 (4.94) | University stu. | 75.71 | ||
|
| Thesis | 0.930 | 153 | English | USA | 38.23 (13.52) | General pop. | 54.20 | ||
| Thesis | 0.730 | 57 | Dutch | Holland | 21.53 (3.72) | University stu. | 64.90 | 2.14 (0.75) | ||
|
| Article | 0.887 | 500 | Chinese | China | 19.4 (0.55) | University stu. | 83.50 | 4.25 (0.5) | |
|
| Article | 0.867 | 306 | English | 20.36 (4.0) | University stu. | 88.88 | 2.8 (0.86) | ||
|
| Article | 0.950 | 683 | English | Pakistan | 18.83 (0.19) | University stu. | 3.11 (0.83) | ||
|
| Article | 0.834 | 1827 | Chinese | China | 19.07 (1.09) | University stu. | 75.50 | ||
|
| Article | 0.827 | 576 | Chinese | China | 18.16 (0.73) | University stu. | 44.37 | 3.24 (1.14) | |
|
| Article | 0.850 | 106 | Korean | Korea | 22.7 (2.89) | General pop. | 61.30 | 3.35 (0.66) | |
|
| Article | 0.890 | 466 | Chinese | China | 20.18 (1.42) | University stu. | 89.78 | 2.76 (0.83) | |
|
| Article | 0.840 | 310 | English | Spain | 30.0 (10.1) | General pop. | 46.80 | ||
|
| Article | 0.795 | 1,181 | Chinese | China | 18.91 (0.85) | University stu. | 50.72 | 2.85 (0.77) | |
|
| Article | 0.610 | 536 | English | Saudi Arabia | 24.27 (5.62) | University stu. | 39.60 | 3.77 (0.23) | |
|
| Article | 0.875 | 133 | English | Iran | University stu. | ||||
|
| Article | 0.800 | 913 | Chinese | China | 19.72 (1.24) | General pop. | 54.00 | 3.27 (0.67) | |
|
| Article | 0.793 | 2,167 | Chinese | China | 12.99 (1.27) | Adolescent | 44.76 | 3.01 (0.51) | |
|
| Article | 0.882 | 821 | English | Belgium | 45.6 (18.01) | General pop. | 59.00 | 2.62 (0.7) | |
|
| Article | 0.790 | 490 | Arabic | Iraq | University stu. | 73.00 | 2.85 (0.75) | ||
|
| Article | 0.910 | 815 | Chinese | China | 19.53 (1.31) | University stu. | 87.24 | 3.13 (0.71) | |
|
| Article | 0.738 | 364 | Chinese | China | 19.48 (0.93) | University stu. | 67.85 | 3.36 (0.74) | |
|
| Article | 0.730 | 213 | English | Young adults | 80.28 | 2.99 (0.75) | |||
|
| Article | 0.740 | 419 | Spanish | Peru | 21.68 (3.26) | University stu. | 66.11 | 3.31 (0.86) | |
|
| Thesis | 0.920 | 32 | Dutch | Holland | 16.1 (0.81) | Adolescent | 68.75 | 2.83 (0.82) | |
|
| Article | 0.800 | 355 | English | China | 19.42 (1.33) | University stu. | 83.10 | 2.75 (0.68) | |
|
| Article | 0.787 | 401 | Chinese | China | 19.48 (0.85) | University stu. | 66.08 | 3.18 (0.66) | |
|
| Article | 0.750 | 591 | Arabic | Lebanon | 21.13 (4.08) | University stu. | 81.20 | 2.68 (0.73) | |
|
| Article | 0.870 | 211 | English | Hungary | 22.25 (3.47) | University stu. | 71.60 | 2.82 (0.68) | |
|
| Article | 0.860 | 0.860 | 574 | Japanese | Japan | 44.25 (12.84) | General pop. | 50.00 | 3.26 (0.86) |
|
| Article | 0.800 | 1,021 | Chinese | China | 18.97 (0.96) | University stu. | 67.19 | 3.61 (0.75) | |
|
| Article | 0.859 | 0.834 | 431 | Polish | Poland | 22.2 (3.23) | University stu. | 88.90 | 3.03 (0.68) |
| Article | 0.862 | 0.839 | 335 | Polish | Poland | 38.7 (13.3) | General pop. | 51.00 | 3.14 (0.76) | |
|
| Article | 0.870 | 55 | English | International | 28.7 (6.5) | General pop. | 25.45 | 3.13 (0.87) | |
|
| Article | 0.860 | 1,336 | Chinese | China | 19.23 (1.49) | Undergraduate & graduate stu. | 65.76 | 3.27 (0.83) | |
|
| Article | 0.846 | 1,048 | Chinese | China | 20.25 (2.29) | University stu. | 44.20 | 2.79 (0.76) | |
|
| Article | 0.880 | 98 | English | USA & Sweden | 21.0 (1.7) | University stu. | 62.24 | 3.2 (0.74) | |
|
| Article | 0.920 | 217 | German | Germany | 26.9 (7.0) | General pop. | 29.00 | 3.14 (0.43) | |
|
| Article | 0.720 | 374 | English | Korea | 23.08 (2.17) | General pop. | 84.50 | 3.24 (0.84) | |
|
| Article | 0.540 | 60 | Korean | Korea | 21.33 (2.35) | Young adults | 86.70 | 3.34 (0.83) | |
|
| Article | 0.910 | 541 | Chinese | China | University stu. | 94.30 | 2.86 (0.62) | ||
|
| Article | 0.850 | 304 | English | Poland | 28.54 (7.97) | University stu. | 71.70 | 2.86 (0.67) | |
|
| Article | 0.850 | 175 | English | Poland | 17.66 (0.85) | Adolescent | 46.85 | ||
|
| Thesis | 0.900 | 141 | English | Holland | 42.3 (15.6) | General pop. | 69.00 | 2.77 (0.96) | |
|
| Article | 0.870 | 221 | English | Singapore | 23.64 (5.72) | General pop. | 63.30 | 3.05 (0.94) | |
|
| Article | 0.920 | 177 | English | USA | 39.7 (11.0) | General pop. | 51.40 | 3.11 (0.56) | |
|
| Article | 0.880 | 2,431 | English | Holland | 50.7 (18.1) | General pop. | 54.50 | 2.96 (0.6) | |
|
| Article | 0.930 | 20 | English | Germany | 12.9 (1.68) | Adolescent | 100.00 | 3.38 (0.71) | |
|
| Article | 0.865 | 768 | Turkish | Türkiye | General pop. | 65.90 | 3.23 (0.85) | ||
|
| Article | 0.824 | 0.817 | 300 | Korean | Korea | 17.0 (0.9) | Adolescent | 50.00 | 2.78 (0.92) |
|
| Article | 0.772 | 522 | Chinese | China | 29.87 (4.85) | In-service clinical nurses | 88.31 | 3.12 (0.21) | |
|
| Article | 0.920 | 1,423 | Chinese | China | University stu. | 80.60 | 2.71 (0.81) | ||
|
| Article | 0.870 | 763 | Chinese | China | 19.48 (2.06) | University stu. | 64.60 | ||
|
| Thesis | 0.920 | 327 | English | New Zealand | 20.93 (6.34) | University stu. | 80.70 | 3.22 (0.86) | |
|
| Article | 0.853 | 4,196 | English | China | 29.17 (0.14) | General pop. | 42.28 | 3.48 (0.72) | |
|
| Article | 0.840 | 990 | Chinese | China | 23.06 (4.21) | University stu. | 46.06 | 2.89 (0.8) | |
|
| Article | 0.890 | 0.892 | 252 | English | China | 20.32 (1.47) | University stu. | 100.00 | 2.52 (0.32) |
|
| Article | 0.820 | 1,550 | Chinese | China | 19.3 (0.98) | University stu. | 69.29 | 3.22 (0.77) | |
|
| Article | 0.850 | 3,687 | Chinese | China | 16.17 (2.42) | General pop. | 57.23 | 3.17 (0.81) | |
|
| Article | 0.868 | 707 | Chinese | China | University stu. | 70.16 | 3.25 (0.8) | ||
|
| Article | 0.867 | 267 | Chinese | China | University stu. | 67.41 | |||
| Article | 0.863 | 361 | Chinese | China | University stu. | 73.68 | 3.38 (0.83) | |||
|
| Article | 0.831 | 552 | Chinese | China | 19.22 (0.64) | University stu. | 62.86 | 3.01 (0.85) | |
|
| Article | 0.982 | 3,599 | Chinese | China | 19.12 (1.05) | University stu. | 44.80 | 2.7 (0.8) | |
|
| Article | 0.855 | 583 | Chinese | China | 20.02 (1.85) | University stu. | 71.35 | 2.98 (0.36) | |
|
| Article | 0.900 | 0.920 | 252 | Japanese | Japan | 39.36 (9.27) | General pop. | 55.60 | 2.97 (0.83) |
| Article | 0.900 | 0.920 | 630 | Japanese | Japan | 37.69 (12.82) | General pop. | 57.20 | ||
|
| Article | 0.860 | 271 | Chinese | China | 21.5 (2.8) | University stu. | 58.30 | 2.86 (0.73) | |
|
| Article | 0.910 | 234 | English | USA | 37.1 (13.48) | General pop. | 42.31 | 2.7 (0.78) | |
|
| Article | 0.910 | 317 | Turkish | Türkiye | 21.78 (3.94) | General pop. | 78.86 | 2.97 (0.77) | |
|
| Article | 0.900 | 560 | Portuguese | Portugal | 29.85 (12.83) | General pop. | 74.50 | 3.2 (0.88) | |
|
| Article | 0.900 | 0.900 | 653 | Portuguese | Portugal | 29.8 (12.45) | General pop. | 74.70 | 2.66 (0.72) |
|
| Thesis | 0.720 | 711 | English | Australia | 15.1 (1.2) | Adolescent | 47.30 | 2.8 (0.41) | |
|
| Article | 0.850 | 121 | English | Singapore | 15.9 (1.14) | Adolescent | 54.55 | 3.08 (0.75) | |
|
| Article | 0.850 | 119 | English | Singapore | 22.66 (1.67) | University stu. | 53.78 | 2.25 (0.77) | |
|
| Article | 0.873 | 769 | English | 20.89 (1.63) | University stu. | 75.90 | 3.38 (1.24) | ||
|
| Article | 0.678 | 192 | Indonesian | Indonesia | Adolescent | 75.00 | 3.22 (0.84) | ||
|
| Article | 0.910 | 262 | German | Germany | 35.35 (14.05) | General pop. | 66.41 | 3.03 (0.77) | |
|
| Article | 0.840 | 433 | Persian | Iran | 22.57 (3.52) | General pop. | 55.70 | 3.23 (0.89) | |
|
| Article | 0.920 | 0.900 | 241 | English | Pakistan | 29.72 (9.3) | General pop. | 87.5 | 2.79 (0.65) |
|
| Article | 0.820 | 433 | English | Iran | University stu. | 55.66 | 2.91 (0.8) | ||
|
| Thesis | 0.900 | 446 | Portuguese | Portugal | 23.7 (5.49) | University stu. | 70.00 | 3.11 (0.85) | |
|
| Article | 0.880 | 336 | English | UK | 43.11 (11.41) | General pop. | 56.00 | 3.19 (0.62) | |
|
| Article | 0.800 | 453 | Chinese | China | 21.21 (1.59) | University stu. | 44.80 | ||
|
| Article | 0.770 | 737 | Chinese | China | 20.05 (1.38) | University stu. | 80.87 | 3.08 (0.61) | |
|
| Article | 0.780 | 300 | English | Pakistan | University stu. | 50.00 | 3.25 (1.69) | ||
|
| Article | 0.780 | 560 | English | India | 19.8 (1.9) | Undergraduate & graduate stu. | 57.50 | 3.12 (0.82) | |
|
| Article | 0.890 | 134 | English | UK | 30.22 (13.5) | General pop. | 77.40 | 3.07 (0.76) | |
| Article | 0.900 | 646 | English | UK | 30.74 (12.2) | General pop. | 68.90 | |||
|
| Article | 0.824 | 0.817 | 300 | Korean | Korea | 17.0 (0.9) | University stu. | 50.00 | 2.68 (0.68) |
|
| Article | 0.870 | 220 | English | Australia | 20.34 (2.86) | University stu. | 67.73 | 3.49 (0.76) | |
|
| Article | 0.830 | 270 | English | Singapore | 22.39 (5.41) | University stu. | 73.33 | 2.74 (0.76) | |
|
| Article | 0.890 | 418 | German | Germany | 23.3 (3.0) | Young adults | 83.60 | 3.62 (0.64) | |
|
| Article | 0.790 | 910 | Chinese | China | 20.14 (3.48) | University stu. | 60.00 | 3.02 (0.39) | |
|
| Article | 0.910 | 229 | Turkish | Türkiye | 27.82 (10.81) | General pop. | 68.12 | 3.1 (0.74) | |
|
| Article | 0.610 | 497 | Turkish | Türkiye | 20.41 (1.83) | University stu. | 72.80 | 3.42 (0.77) | |
|
| Article | 0.760 | 553 | Turkish | Türkiye | 20.55 (2.17) | University stu. | 69.60 | 2.8 (0.8) | |
|
| Article | 0.840 | 54 | Chinese | China | 19.8 (0.6) | University stu. | 92.60 | ||
|
| Article | 0.876 | 935 | Chinese | China | University stu. | 53.50 | 4.17 (1.43) | ||
|
| Thesis | 0.750 | 149 | Dutch | Holland | 38.8 (13.3) | General pop. | 53.69 | 3.25 (0.73) | |
|
| Article | 0.890 | 855 | Chinese | China | 21.16 (1.83) | University stu. | 46.20 | ||
|
| Article | 0.804 | 1,217 | Chinese | China | 20.3 (2.2) | University stu. | 42.65 | 3.11 (0.98) | |
|
| Article | 0.850 | 2044 | English | China | University stu. | 63.11 | 3.71 (0.3) | ||
|
| Article | 0.740 | 182 | English | Pakistan | 21.98 (2.17) | University stu. | 100.00 | 3.01 (0.41) | |
|
| Article | 0.852 | 108 | English | Malaysia | 22.27 (1.89) | Young adults | 57.40 | 3.1 (0.8) | |
|
| Article | 0.880 | 1,104 | Chinese | China | 20.2 (1.43) | University stu. | 63.00 | 2.94 (0.93) | |
|
| Article | 0.880 | 1,103 | Chinese | China | 20.17 (1.43) | University stu. | 63.00 | 3.02 (0.66) | |
|
| Article | 0.872 | 356 | Chinese | China | University stu. | 50.60 | 2.46 (0.63) | ||
|
| Article | 0.840 | 464 | English | China | 21.7 (3.1) | University stu. | 65.50 | 3.15 (0.87) | |
|
| Article | 0.870 | 2052 | Chinese | China | 20.0 (1.53) | University stu. | 68.40 | 2.91 (0.63) | |
|
| Article | 0.790 | 427 | English | China | 19.36 (1.06) | University stu. | 66.00 | 3.13 (1.01) | |
|
| Thesis | 0.860 | 105 | English | International | 23.9 (4.07) | University stu. | 73.38 | ||
|
| Article | 0.805 | 318 | Chinese | China | 16.92 (0.67) | Adolescent | 64.20 | 3.2 (0.56) | |
|
| Article | 0.813 | 698 | Chinese | China | 20.15 (1.77) | University stu. | 33.38 | 3.41 (0.77) | |
|
| Article | 0.830 | 3,539 | Chinese | China | 15.6 (2.9) | Elementary to college stu. | 59.80 | 3.36 (0.87) | |
|
| Article | 0.917 | 403 | English | Pakistan | 23.42 (4.2) | University stu. | 58.31 | 3.28 (0.91) | |
|
| Article | 0.874 | 288 | Chinese | China | 20.89 (2.27) | Undergraduate & graduate stu. | 68.40 | 2.98 (0.75) | |
|
| Article | 0.860 | 474 | Chinese | China | University stu. | 53.00 | 3.23 (0.84) | ||
|
| Article | 0.922 | 6,543 | Chinese | China | University stu. | 64.70 | 3.19 (0.84) | ||
|
| Article | 0.795 | 391 | Chinese | China | 19.48 (0.86) | University stu. | 66.75 | 3.34 (0.81) | |
|
| Article | 0.760 | 2,822 | Chinese | China | 19.77 (1.41) | University stu. | 71.40 | 3.55 (0.64) | |
|
| Article | 0.780 | 668 | Chinese | China | 20.36 (1.69) | University stu. | 64.97 | 2.92 (0.66) | |
|
| Article | 0.830 | 665 | Chinese | China | 13.72 (1.64) | Adolescent | 49.77 | 3.29 (0.87) | |
| Moderator |
|
| Estimate ( | 95% CI |
| ||
|---|---|---|---|---|---|---|---|
| Mean age | 110 | 0.182 | 98.10 | 6.87 | 0.016 | 0.005, 0.027 | 0.005* |
| SD age | 109 | 0.175 | 98.04 | 8.75 | 0.030 | 0.012, 0.048 | 0.001* |
| Women (%) | 125 | 0.177 | 98.19 | 0.00 | 0.000 | −0.005, 0.005 | 0.935 |
| Sample size ( | 127 | 0.179 | 98.13 | 2.07 | 0.000 | −0.000, 0.000 | 0.072 |
| Sample size (log | 127 | 0.184 | 98.21 | 0.00 | 0.020 | −0.059, 0.100 | 0.619 |
| Mean BPS score | 113 | 0.194 | 98.39 | 1.27 | −0.200 | −0.470, 0.070 | 0.144 |
| sd BPS score | 113 | 0.197 | 98.42 | 0.00 | 0.197 | −0.231, 0.625 | 0.365 |
| Moderator | Category |
| Alpha [CI] |
|
| df |
|
|---|---|---|---|---|---|---|---|
| Region | Asia | 84 | 0.845 [0.830–0.858] | 40.903* | 5.776* | 1 | 0.016 |
| Europe | 25 | 0.877 [0.855–0.896] | 24.499* | ||||
| Scale language | Chinese | 49 | 0.854 [0.835–0.871] | 30.954* | 0.539 | 2 | 0.764 |
| English | 47 | 0.859 [0.840–0.876] | 30.329* | ||||
| Others | 31 | 0.849 [0.823–0.870] | 23.721* | ||||
| Sample group | Adolescent | 12 | 0.825 [0.777–0.863] | 14.088* | 10.742* | 2 | 0.005 |
| General population | 33 | 0.882 [0.864–0.897] | 29.439* | ||||
| University students | 76 | 0.849 [0.835–0.863] | 39.931* | ||||
| Publication type | Article | 118 | 0.854 [0.842–0.865] | 47.748* | 0.421 | 1 | 0.516 |
| Thesis | 9 | 0.868 [0.822–0.902] | 13.445* |
| Moderator |
|
| Estimate ( | 95% CI |
| ||
|---|---|---|---|---|---|---|---|
| Mean age | 11 | 0.112 | 95.09 | 20.00 | 0.020 | −0.005, 0.045 | 0.099 |
| SD age | 11 | 0.108 | 94.92 | 22.75 | 0.039 | −0.007, 0.085 | 0.086 |
| Women (%) | 11 | 0.150 | 96.33 | 0.00 | 0.004 | −0.011, 0.020 | 0.546 |
| Sample size ( | 11 | 0.153 | 96.30 | 0.00 | 0.000 | −0.001, 0.002 | 0.656 |
| Sample size (log | 11 | 0.156 | 96.41 | 0.00 | 0.042 | −0.724, 0.808 | 0.903 |
| Mean BPS score | 10 | 0.109 | 94.57 | 11.88 | −0.612 | −1.587, 0.363 | 0.186 |
| SD BPS score | 10 | 0.123 | 95.35 | 1.01 | −0.738 | −2.385, 0.909 | 0.332 |
| Sample group |
|
|
| df |
| |
|---|---|---|---|---|---|---|
| General population | 6 | 0.894 [0.863–0.917] | 17.345* | 5.445* | 1 | 0.020 |
| University students | 4 | 0.828 [0.766–0.874] | 11.097* |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSleep and related disorders · Perfectionism, Procrastination, Anxiety Studies · Restless Legs Syndrome Research
Introduction
Procrastination is a common phenomenon that occurs frequently in daily life and is extensively studied. In the scholarly literature, procrastination is characterized as the voluntary postponement of intended tasks or decisions, even when individuals anticipate that such delays will be detrimental to desired outcomes (Steel, 2007; Sirois and Pychyl, 2013).
Procrastination occurs across multiple life domains, including academic contexts (Schouwenburg, 1995; Klingsieck, 2013; Steel, 2007), occupational settings (Nguyen et al., 2013), everyday household tasks (Ferrari et al., 1995; Milgram et al., 1988), and health-related behaviors (Sirois, 2004; Sirois, 2007), where individuals delay necessary tasks in ways that may undermine performance and wellbeing.
Beyond these domains, procrastination has also been increasingly examined in relation to sleep-related behaviors. Research over the past decade has consistently shown that insufficient or poor-quality sleep is associated with a range of adverse mental and physical health outcomes. Bedtime procrastination, as a self-regulatory failure that prioritizes leisure or technology-related activities over sleep, has been identified as a key behavioral contributor to sleep deprivation, leading to reduced sleep duration and impaired sleep quality (Chung et al., 2020; Kroese et al., 2014; Kühnel et al., 2016). In contemporary societies characterized by fast-paced lifestyles (Åkerstedt and Nilsson, 2003; Basner et al., 2007) and pervasive use of digital technology (Cain and Gradisar, 2010; Exelmans and Bulck, 2017), bedtime procrastination has emerged as a prevalent and consequential behavior with significant implications for mental and physical health (Ford and Kamerow, 1989; Howarth and Miller, 2024; Manocchia et al., 2001; Zhang et al., 2024). In light of the growing recognition of bedtime procrastination as a significant determinant of sleep-related outcomes, it is imperative to employ a reliable and valid instrument to measure this construct. Furthermore, the existing measurement tool must be subjected to rigorous evaluation in terms of its validity and reliability.
Bedtime procrastination scale
Bedtime procrastination was defined as the voluntary delay of going to bed without external circumstances preventing sleep. The Bedtime Procrastination Scale (BPS), originally developed in English (Kroese et al., 2014), is a widely used instrument to assess this construct. The instrument comprises nine items within a unidimensional structure, four of which are reverse-worded. Over the past decade, the BPS has been translated and adapted into several languages, including Arabic (Hammoudi et al., 2021), Chinese (Ma et al., 2021), German (Bernecker and Job, 2019), Spanish (Brando-Garrido et al., 2022), Dutch (Broers, 2014), Korean (An et al., 2019), Japanese (Miyagawa et al., 2024), Polish (Herzog-Krzywoszanska and Krzywoszanski, 2019), Portuguese (Magalhães et al., 2020), Indonesian (Rahayu and Caninsti, 2024), Persian (Rasouli et al., 2025), and Turkish (Dinç et al., 2016). With these various versions, the BPS has been administered across a wide range of cultural contexts, including Western (e.g., the USA and Germany), East Asian (e.g., China and Japan), South Asian (e.g., India and Pakistan), Middle Eastern (e.g., Iran and Saudi Arabia), and Southeast Asian societies (e.g., Singapore and Indonesia). In addition, studies employing the BPS have targeted populations with age ranges spanning from early adolescence to adulthood, with approximately 13–56 years (Exelmans and Bulck, 2021; Kullik et al., 2025). These populations include young adults (Flores et al., 2023; Jeoung et al., 2023), adolescents (Deng et al., 2024; Pu et al., 2022), university students (Hamvai et al., 2023; Xu et al., 2024), and the general population (Deng et al., 2022; Miyagawa et al., 2024).
This extensive use underscores the widespread acceptance and versatility of BPS as a measure of bedtime procrastination. However, the extensive use of the BPS across diverse populations and research contexts also raises important questions regarding the consistency and generalizability of its reliability estimates. Despite its widespread use, the reliability of its scores has shown substantial inconsistency across studies, with reported internal consistency estimates (e.g., Cronbach’s alpha and McDonald’s omega) ranging from 0.540 to 0.982. Moreover, several studies have relied on reliability coefficients reported in the original validation study rather than estimating reliability from their own data, a practice referred to as reliability induction (Thompson and Vacha-Haase, 2000; Huang et al., 2023; Manasa and Saju Stephen, 2024; Son and Kwag, 2021). This practice reflects the erroneous assumption that reliability is an invariant property of the instrument itself, rather than a characteristic of the scores obtained within a specific sample and research context (Thompson and Vacha-Haase, 2000). Given the diversity of samples and study conditions under which the BPS has been administered, variability in reported reliability estimates is expected, and the absence of consistently reported coefficients further limits conclusions regarding the scale’s psychometric robustness.
The widespread use of the Bedtime Procrastination Scale across multiple languages and diverse target populations, together with the observed variability and lack of stability in its reported reliability coefficients across studies, was explicitly highlighted as the research gap that justifies conducting a reliability generalization meta-analysis (RGMA) of the BPS.
Meta-analytical reliability generalization
Reliability is a fundamental psychometric property of test scores, reflecting the consistency of scores across administrations under comparable conditions, yet it may vary across applications as a function of score variability, sample characteristics, and administration procedures (Crocker and Algina, 1986; Henson and Thompson, 2002). Considering that reliability coefficients can fluctuate across different administrations, systematically identifying the factors that influence—or do not influence—this variability within the context of a given measurement instrument can inform the implementation of more rigorous and precise reliability practices in subsequent research employing the instrument (Henson and Thompson, 2002; López-Ibáñez et al., 2024; Vacha-Haase et al., 2002). In line with American Psychological Association’s (2020) Journal Article Reporting Standards, researchers are explicitly encouraged to estimate and report reliability coefficients for the scores analyzed in their own samples, underscoring that reliability is a property of the obtained scores rather than a fixed characteristic of the measurement instrument (Appelbaum et al., 2018). Fundamentally developed for this purpose, reliability generalization (RG) studies aim to (a) examine the distribution of reliability coefficients reported in the literature, (b) identify possible sources that account for variability in these estimates, and (c) provide pooled reliability estimates for the instrument under investigation (López-Ibáñez et al., 2024; Vacha-Haase, 1998; Vacha-Haase et al., 2002). Through the use of RG, researchers can potentially design future studies in ways that enhance score reliability, increase effect sizes, improve statistical power, and strengthen the likelihood of obtaining significant results (Henson and Thompson, 2002).
To summarize, the widespread use of the Bedtime Procrastination Scale, along with the non-constant nature of its reliability and the variability of reliability coefficients reported across studies, underscores the importance of conducting a reliability generalization meta-analysis for this instrument. Such an approach enables a systematic evaluation of the psychometric robustness of the BPS, ensuring its valid and reliable application across diverse populations and research contexts.
Purpose of the study
The purpose of the present study is to examine the meta-analytic reliability of the BPS while considering various moderator variables, thereby accounting for the heterogeneity observed in reliability coefficients.
By applying this approach to the BPS, the present study aims to (a) quantify the overall reliability of the scale across published research, (b) evaluate the variability in reliability estimates, and (c) explore predictors of heterogeneity of the reliability, with continuous moderators including the mean age, standard deviation of age, sample size, percentage of female, mean of BPS scale score and standard deviation of BPS scale score by the way categorical moderators including region, language, sample type and publication type.
Method
A Reliability Generalization Meta-Analysis (RGMA; Vacha-Haase, 1998) was conducted to estimate average population reliability coefficients for the BPS. The conduct and reporting followed the REGEMA guidelines and checklist proposed by Sánchez-Meca et al. (2021), developed to address the lack of dedicated RG standards. These guidelines aim to enhance clarity, reproducibility, and transparency in RG studies through a structured flow and a 30-item checklist across eight dimensions, which informed study selection, coding, and synthesis.
Search strategy
In the first stage of the data collection process, a literature search was conducted in the Scopus, Proquest, Web of Science, ScienceDirect, EBSCO, and Google Scholar databases using the keyword “Bedtime Procrastination Scale.” These searches were conducted in July 2025, and no year limit was set for the studies. Since the scale addressed in the study was developed in 2014, the studies reviewed cover the period between 2014 and July 2025. Hand-searching was also performed.
Inclusion and exclusion criteria
In the process of including studies in the meta-analysis, (1) the study must have been published as an article or thesis; (2) the language of publication must be English; (3) the scale used in the study must be the Bedtime Procrastination Scale (BPS) developed by Kroese et al. (2014) or adapted versions; and (4) the reliability coefficient and sample size related to the BPS must be reported in the study.
On the other hand, studies were excluded (1) if the studies were conducted using a qualitative research design, bibliometric studies, meta-analyses, or systematic reviews; (2) if the items on the BPS scale were reduced or additional items were added; (3) if the BPS scale was used with a rating scale other than a 5-point Likert scale; (4) if the study was written in a language other than English, (5) if the study was published in a format other than an article or thesis, (6) if the study used data from a previously included study (same sample), and (7)sStudies that used the BPS but either explicitly reported no reliability coefficients (by report) or omitted them altogether (by omission), and did not respond to requests for reliability information or an anonymized data set.
Figure 1 shows the REGEMA flowchart for BPS, which summarizes the selection process of studies included in the meta-analysis. This process includes both exclusion and inclusion stages.
REGEMA flowchart for BPS.
A total of 650 records were identified through database searching (Scopus = 91, ProQuest = 36, Web of Science = 48, ScienceDirect = 23, EBSCO = 121, Google Scholar = 326) and other sources (hand-searching = 5). After removing 324 duplicates, 326 records remained for screening. Of these, 123 were excluded based on titles and abstracts. The full texts of 202 reports were assessed for eligibility, with one report not retrieved. Among the 201 accessible reports, 76 were excluded for the following reasons: no reliability data (n = 27), nonacceptable publication type (n = 21), non-English (n = 13), not original 9-item 5-point BPS (n = 6), no BPS data (n = 3), duplicate data (n = 3), non-acceptable study design (n = 2), and unclear reporting of reliability (n = 1). Ultimately, 122 studies were included in the review. Since six of these studies were conducted on two different groups, analyses were performed with a total of 128 independent reliability coefficients. Since these subgroups were based on non-overlapping participant samples, each reliability coefficient was treated as an independent unit of analysis, consistent with recommendations for independent subgroups within studies (Borenstein et al., 2009, Chapter 23). In addition, when the same participant group was assessed at multiple time points within a study, these coefficients were not treated as independent. Instead, a single composite reliability coefficient was computed for that study by aggregating across time points, along with a corresponding mean BPS score and standard deviation. This approach follows established recommendations for handling multiple outcomes or time points based on the same participants to avoid violating independence assumptions (Borenstein et al., 2009, Chapter 24).
Data extraction
In reliability generalization research, variability in reported reliability coefficients is commonly examined in relation to study-level methodological and sample characteristics. This variability can be attributed to the methodological characteristics including sample size (Peterson, 1994; Vassar and Bradley, 2011), publication year (Greco et al., 2018), publication type (Vassar et al., 2011; Tümtürk and Sen, 2025), mean of the scale score (Hayat, 2024; Miller et al., 2018; Tümtürk and Sen, 2025), and standard deviation of the scale score (Aguayo-Estremera et al., 2011; Miller et al., 2018), as well as participants’ demographic characteristics such as age (Pretorius and Padmanabhanunni, 2025), gender (Aguayo-Estremera et al., 2011; Vassar et al., 2011), and sample type (Peterson, 1994; Pretorius and Padmanabhanunni, 2025). In addition, contextual variables including language (Grace et al., 2018; Pretorius and Padmanabhanunni, 2025) or geographical region of the study (Aguayo-Estremera et al., 2011; Tümtürk and Sen, 2025) are frequently considered.
Consistent with prior reliability generalization studies that have examined methodological, sample-related, and contextual sources of variability, the present study operationalized these factors through a standardized data extraction procedure. For each study, the study tag (first author and year) and the publication type (e.g., peer-reviewed article and thesis) were recorded. Reported reliability coefficients were extracted, including Cronbach’s alpha and/or McDonald’s omega. Sample-related characteristics were documented, such as sample size, language of the BPS administration, country of data collection, participants’ mean age and standard deviation, sample group (e.g., university students, adolescents, and general population), and the female ratio in the sample. Descriptive statistics for the scale, including the BPS mean and standard deviation, were also extracted. Where necessary, BPS means and standard deviations reported at the total-score level were recalculated and transformed to item-level metrics to ensure consistency across studies. Specifically, some primary studies reported descriptive statistics based on the summed total BPS score, whereas others reported item-level mean scores (i.e., average per item). To ensure comparability across studies, all total-score means and standard deviations were converted to item-level means and standard deviations by dividing the total score statistics by the number of items in the BPS. This harmonization allowed all descriptive statistics to be expressed on a common item-level scale.
Study selection and coding were conducted using a multi-stage and systematic procedure. All screening and eligibility decisions were managed using the Rayyan software (Ouzzani et al., 2016). In the first stage, title and abstract screening were performed independently and in parallel by two reviewers. All records were evaluated by both reviewers and labelled in Rayyan as include, exclude, or maybe. Records for which the reviewers’ decisions were discordant (i.e., those marked as conflict in Rayyan), as well as records labelled as maybe by at least one reviewer, were discussed in consensus meetings involving all three authors. Final inclusion or exclusion decisions were reached by agreement, in accordance with the predefined eligibility criteria.
Studies that passed the abstract screening stage and were selected for full-text review were then examined in detail by two reviewers. For all studies deemed eligible at this stage, relevant information was coded using a standardized data extraction form. In the final stage, all data extraction sheets prepared for the included studies were independently reviewed by the third author. Any potential errors, omissions, or inconsistencies identified at this stage were re-evaluated in consultation with the other authors, and necessary corrections were made before finalizing the dataset.
This multi-stage, independent, and consensus-based review process was designed to ensure consistency and methodological rigor in the selection and coding of studies included in the meta-analysis.
Data analysis
All analyses were conducted within a reliability generalization meta-analytic framework (Vacha-Haase, 1998; Rodríguez and Maeda, 2006) to estimate pooled internal consistency coefficients for the BPS and to examine sources of variability across studies. Both Cronbach’s alpha (α) and McDonald’s omega (ω) were included when reported. Because reliability coefficients are bounded between 0 and 1 and typically skewed, estimates were transformed prior to analysis using Bonett’s (2002) ABT variance-stabilizing transformation, which has been recommended for internal consistency coefficients such as α and ω (López-Ibáñez et al., 2024). This approach yields effect sizes with approximately normal sampling distributions and known large-sample variances. At this stage, all meta-analytic computations, including pooled reliability estimates, confidence intervals, and heterogeneity statistics, were performed using the transformed coefficients. To facilitate interpretation of the results, all pooled reliability estimates and their corresponding confidence intervals were subsequently back-transformed to the original reliability coefficient metric (Cronbach’s alpha and McDonald’s omega). Accordingly, the pooled reliability estimates and confidence intervals reported in this study are presented on the original scale familiar to readers. This approach enhances the interpretability of results obtained from statistical analyses conducted on the transformed scale in the RG study (Sánchez-Meca et al., 2013).
Random-effects models were fit using restricted maximum likelihood (REML) estimation to obtain pooled reliability estimates and between-study variance (τ^2^). Heterogeneity was assessed with Cochran’s Q, the I^2^ index (Higgins and Thompson, 2002), and the H^2^ statistic. In total, 95% prediction intervals were also calculated to describe the plausible range of reliability coefficients in future studies. Robustness was evaluated through leave-one-out influence diagnostics and examination of standardized residuals; in addition, normal Q–Q plots were inspected to assess distributional assumptions of model residuals (Viechtbauer, 2010).
Publication bias and small-study effects were examined using complementary methods. Funnel plots were inspected visually, accompanied by Egger’s regression test and Begg–Mazumdar rank correlation. Duval and Tweedie’s (2000) trim-and-fill procedure was applied as a sensitivity analysis, and precision-effect and precision-effect estimate with standard error (PET and PEESE) regressions were conducted (Stanley and Doucouliagos, 2014).
Reliability induction is a specific form of publication bias in RGMA, where the bias is introduced through the selective reporting or omission of reliability coefficients (López-Ibáñez et al., 2024). In the present study, of the 149 studies identified as using the BPS, 27 did not provide sample-specific reliability estimates: 14 did not report reliability coefficients, and 13 reported reliability values from prior studies and did not respond to author contact attempts. Accordingly, the reliability induction rate was calculated as 18.12%.
To explore heterogeneity, moderator analyses were conducted at both continuous and categorical levels. Mixed-effects meta-regressions were performed with sample characteristics (mean age, SD of age, female ratio, sample size) and scale characteristics (mean and standard deviation of BPS scores). For categorical moderators, we pre-specified the subgroup levels and applied a common-τ^2^ framework. For α, we compared effects across four moderators: (i) Region (Asia and Europe; studies from other continents or with mixed/international samples were excluded for this moderator), (ii) Scale language (Chinese, English, and Others), (iii) Sample group [adolescent, general population, and university students (undergraduate and postgraduate students combined into this category)], and (iv) Publication type (article and thesis). For ω, subgroup analyses were limited to the sample group (general population and university students). Levels with missing information or fewer than two independent studies were excluded a priori from the relevant moderator analysis to ensure stable within-group estimates (k ≥ 2). Other categorical moderators were not analyzed for ω due to feasibility (sparse cells for Region, Scale language) or no variability (Publication type: all articles).
Region-based analyses were restricted to Asia and Europe, as these categories represented theoretically meaningful and internally coherent groupings with sufficient numbers of studies. Studies conducted in other regions (e.g., North America and Oceania) were not combined into an “other” category because they did not share a common geographical or cultural framework that would allow for a substantively interpretable comparison. For scale language, Chinese and English versions were examined separately due to their substantial representation and distinct measurement contexts. The remaining languages were grouped under “other,” reflecting adapted versions of the BPS with small individual sample sizes that did not permit separate analysis.
All subgroup models assumed a common between-study variance (τ^2^) across levels. We estimated τ^2^ via REML and then obtained level-specific pooled effects using fixed-effect estimation on augmented variances (vi* = vi + τ^2^), with 95% CIs reported on the coefficient scale after back-transformation. Between-group heterogeneity was evaluated with the analog ANOVA statistic Qbetween (df = group−1). Where the omnibus test was significant (α = 0.05), we conducted pairwise contrasts between subgroup means using Wald tests on the transformed scale and applied a Bonferroni adjustment to control family-wise error. In addition, we reported the proportion of between-study variance explained (R^2^) for significant moderators as , following Borenstein et al. (2009).
As a sensitivity analysis, we additionally fitted separate random-effects models within each level of categorical moderators, allowing the between-study variance (τ^2^) to be estimated independently for each subgroup. These analyses revealed that τ^2^ values varied slightly across subgroup levels, indicating differences in residual heterogeneity. However, the pooled reliability estimates and their confidence intervals were highly consistent with those obtained under the common-τ^2^ specification, and the overall pattern of results remained unchanged. Accordingly, the main analyses assuming a common between-study variance are retained for presentation, with subgroup-specific τ^2^ estimates used to evaluate the robustness of the findings.
All analyses were carried out in R (R Core Team, 2024) using the metafor package (Viechtbauer, 2010). Forest plots, funnel plots, and diagnostic figures were generated with ggplot2 and base metafor functions. Supplementary materials include additional diagnostic plots (trim-and-fill funnels, PET/PEESE scatterplots, Q–Q plots) and subgroup tables.
Results
Study characteristics
A total of 122 studies (128 reliability coefficients) were included in the reliability generalization meta-analysis of the BPS. The frequencies of categorical moderators and descriptive statistics of continuous moderators are summarized in Table 1, while Table 2 presents a summary of the coded study characteristics and moderators used in the reliability generalization meta-analysis.
The majority were journal articles, with a smaller number of theses. Sample sizes varied widely, ranging from very small groups of fewer than 30 participants to large-scale studies with several thousand respondents. Studies represented diverse geographical regions and languages, including Chinese, English, Arabic, Turkish, German, Spanish, Portuguese, and others, reflecting broad international use of the BPS.
Participants encompassed a variety of groups, most commonly university students, but also adolescents, general population samples, and young adults. The average age across samples ranged from early adolescence to middle adulthood, with female participation rates differing substantially across studies.
Internal consistency estimates (Cronbach’s alpha and, where available, McDonald’s omega) showed considerable variability across studies. In several cases, more than one coefficient was reported within a single publication due to analyses conducted on multiple groups. When the same study tag appears with suffixes “a” and “b,” this denotes distinct groups within the same study; when tags include “_1” and “_2,” this indicates separate studies conducted by the same author in the same year. Empty cells reflect missing information in the original reports.
Bps alpha
Publication and reporting biases (α)
Publication bias was examined using multiple, complementary diagnostics based on Bonett’s ABT transformation of Cronbach’s alpha. Visual inspection of the funnel plot (Figure 2) shows the distribution of individual studies around the pooled estimate, plotted against the standard error, allowing assessment of potential small-study effects. Visual inspection was paired with formal tests, which did not indicate clear asymmetry: Egger’s regression was non-significant (z = 0.268, p = 0.789), and Begg–Mazumdar’s rank correlation was also non-significant (Kendall’s τ = 0.110, p = 0.068). As a sensitivity check, Duval and Tweedie’s trim-and-fill procedure imputed k₀ = 20 potentially missing studies and yielded a downward-adjusted pooled reliability of α = 0.836 [95% CI (0.822, 0.850)], compared with the original REML estimate of α = 0.855 [95% CI (0.843, 0.865)]; this corresponds to an absolute change of −0.019 (≈ −2.16%), suggesting that possible unpublished (or published but that not report the empirical reliability) studies would have only a modest impact on the pooled estimate.
Funnel plot of Cronbach’s alpha (Bonett-transformed) for the BPS.
To further probe small-study effects, PET and PEESE meta-regressions produced bias-adjusted intercepts of α = 0.8518 and α = 0.8515, respectively, which closely align with the original pooled estimate, suggesting minimal impact of small-study/publication bias on the central estimate. Taken together, the evidence is mixed. That is, trim-and-fill indicates possible missing, less precise studies that would slightly lower the pooled reliability, whereas Egger, Begg-Mazumdar, PET, and PEESE provide little support for material small-study bias. The pooled reliability estimate is therefore interpreted as robust, with all diagnostics reported for transparency (see Supplementary Figures S1–S3: trim-and-fill funnel, PET, and PEESE scatterplots).
Mean reliability and heterogeneity (α)
A random-effects meta-analysis was conducted on 127 (one study only reported omega) independent samples using Bonett’s ABT transformation of Cronbach’s alpha (REML estimator). The pooled effect on the ABT scale1 was 1.9286 [SE = 0.0388, z = 49.71, p < 0.0001; 95% CI (1.8525, 2.0046)], which back-translates to a mean reliability of α = 0.8546 [95% CI (0.8432, 0.8653)]. Between-study heterogeneity was very large: τ^2^ = 0.1827 (SE = 0.0240), τ = 0.4274, I^2^ = 98.24%, H^2^ = 56.88; Cochran’s Q(126) = 11,892.77, p < 0.0001. All heterogeneity statistics (τ^2^ and τ) are reported on the transformed (ABT) scale. Reflecting this heterogeneity, the 95% prediction interval on the alpha scale was wide (0.6629–0.9373), indicating that future studies conducted under similar conditions may plausibly yield reliability estimates across this range. The dispersion of study-specific estimates is visualized in the forest plot (see Supplementary Figure S4), which illustrates both the concentration of effects around the pooled estimate of the mean and the presence of studies with lower and higher reliability.
Leave-one-out diagnostics did not reveal undue influence by any single study. Across the most influential omissions identified, the back-transformed pooled α remained tightly bounded (approximately 0.852–0.856), while heterogeneity indices stayed high (e.g., I^2^ ≈ 97.7–98.2%; τ^2^ ≈ 0.146–0.181). A Q–Q plot of standardized residuals (see Supplementary Figure S5) further indicated approximate normality, with most studies following the theoretical quantile line reasonably well and only modest deviations in the tails. This pattern supports the robustness of the central estimate while reflecting the very high heterogeneity observed across studies. Taken together, the central estimate of reliability is stable, but the magnitude of heterogeneity suggests that moderator analyses are warranted to explain systematic variability across studies.
Meta regressions for continuous moderator variables (α)
Mixed-effects meta-regressions were conducted to examine whether sample characteristics and scale scores accounted for heterogeneity in Cronbach’s alpha coefficients of the BPS (Table 3). Mean age was positively associated with reliability estimates, b = 0.016, 95% CI [0.005, 0.027], p = 0.005, explaining 6.87% of the heterogeneity. Similarly, age variability (sd of age) was a significant positive predictor, b = 0.030, 95% CI [0.012, 0.048], p = 0.001, accounting for 8.75% of the heterogeneity (see Supplementary Figures S6, S7). That is, studies with older samples and greater age variability tend to report higher reliability estimates for the BPS scores. By contrast, the proportion of women in the sample was unrelated to reliability (p = 0.935). With respect to sample size, the raw n specification showed a marginal trend, b ≈ 0.000, 95% CI [−0.000, 0.000], p = 0.072, explaining 2.07% of heterogeneity, whereas the log-transformed N was clearly nonsignificant (p = 0.619). For scale score moderators, neither the mean BPS score (b = −0.200, 95% CI [−0.470, 0.070], p = 0.144) nor the SD of BPS scores (b = 0.197, 95% CI [−0.231, 0.625], p = 0.365) significantly predicted reliability. Both explained negligible portions of heterogeneity (≤1.3%). Overall, while older average age and greater age variability of participants were associated with higher reliability, these effects were small. Moreover, the persistence of very high I^2^ values should be interpreted cautiously, as I^2^ is a relative measure of heterogeneity and may remain inflated in meta-analyses with generally large sample sizes and very small sampling variances, even when the inclusion of moderators leads to only modest reductions in the true between-study variance (τ^2^).
Subgroup analyses for categorical moderator variables (α)
Subgroup analyses (Table 4) using a common-τ^2^ model showed a significant difference by Region [Q(1) = 5.776, p = 0.016], with Europe exhibiting higher reliability (α ≈ 0.877) than Asia (α ≈ 0.845). That is, studies conducted in Europe tend to report more reliable BPS scores than studies conducted in Asia. Scale language showed no differences [Q(2) = 0.539, p = 0.764]. Sample group was significant [Q(2) = 10.742, p = 0.005]; pairwise tests (Bonferroni) indicated that the general population had higher reliability than university students (α ≈ 0.882 vs. 0.849; p_adj = 0.016) and adolescents (α ≈ 0.882 vs. 0.825; p_adj = 0.018), whereas adolescents and university students did not differ (α ≈ 0.825 vs. 0.849; p_adj = 0.760) (see Supplementary Table S1). In other words, BPS scores appear to be more reliable in studies based on general population samples than in studies focusing on university students or adolescents. Publication type showed no difference [Q(1) = 0.421, p = 0.516]. The moderators explained ≈4.1% (Region) and ≈7.5% (Sample group) of between-study variance (R^2^; see Supplementary Table S2).
Bps omega
Publication and reporting biases (ω)
Publication bias was assessed using multiple, complementary diagnostics on the Bonett-transformed scale. Formal tests did not indicate asymmetry: Egger’s regression was non-significant (z = 0.0496, p = 0.9604), and Begg–Mazumdar’s rank correlation was also non-significant (Kendall’s τ = −0.0561, p = 0.8137). Visual inspection of the funnel plot suggested symmetry (Figure 3).
Funnel plot of McDonald’s omega (Bonett-transformed) for the BPS.
As a sensitivity check, Duval and Tweedie’s trim-and-fill procedure imputed k₀ = 0 studies and left the pooled estimate unchanged at ω = 0.867 [95% CI (0.833, 0.894)]. PET and PEESE meta-regressions yielded bias-adjusted intercepts close to the pooled estimate (PET ω = 0.8625; PEESE ω = 0.8561). Taken together, these indicators provide little evidence of material small-study/publication bias, and the central reliability estimate appears robust (see Supplementary Figures S8–S10: trim-and-fill funnel, PET, and PEESE scatterplots).
Mean reliability and heterogeneity (ω)
A random-effects meta-analysis on k = 11 independent samples yielded a pooled effect of 2.019 on the ABT transformed scale [SE = 0.116, z = 17.48, p < 0.0001; 95% CI (1.793, 2.246)]. Back-transformed to McDonald’s omega, the mean reliability was ω = 0.867 with 95% CI [0.834, 0.894]. Between-study heterogeneity was substantial: τ^2^ = 0.140 (SE = 0.066), τ = 0.375, I^2^ = 96.07%, H^2^ = 25.41; Cochran’s Q(10) = 262.60, p < 0.0001. Consistent with this dispersion, the 95% prediction interval on the omega scale was [0.714, 0.938], indicating that future studies conducted under similar conditions may plausibly yield reliability estimates across this range. Study-level estimates and their confidence intervals are displayed in the forest plot (Figure 4). Individual study estimates with 95% confidence intervals are shown as squares and horizontal lines, respectively. The size of the square reflects the study’s weight, and the diamond represents the pooled reliability estimate with its confidence interval.
Forest plot of McDonald’s ω for the BPS.
Leave-one-out analyses did not indicate undue influence by any single study. Across the most influential omissions, heterogeneity remained high (I^2^ ≈ 94.56–96.46%; τ^2^ ≈ 0.101–0.152), and the pooled effect on the transformed scale remained within a narrow band, implying a stable central estimate despite notable between-study variability. A Q–Q plot of standardized residuals (see Supplementary Figure S11) further indicated approximate normality, with most studies following the theoretical quantile line and only modest deviations at the extremes. This supports the robustness of the central estimate while highlighting the very high heterogeneity observed across studies.
Meta regressions for continuous moderator variables (ω)
Mixed-effects meta-regressions were conducted to examine whether sample characteristics and scale scores accounted for heterogeneity in McDonald’s ω coefficients of the BPS (Table 5). None of the continuous moderators reached statistical significance (all p ≥ 0.086). For age-related moderators, mean age showed a non-significant positive association, b = 0.020, 95% CI [−0.005, 0.045], p = 0.099, and age variability (sd of age) was likewise non-significant, b = 0.039, 95% CI [−0.007, 0.085], p = 0.086. The proportion of women was unrelated to reliability, b = 0.004, 95% CI [−0.011, 0.020], p = 0.546. Neither sample size in raw units [b ≈ 0.000, 95% CI (−0.001, 0.002), p = 0.656] nor on the log scale [b = 0.042, 95% CI (−0.724, 0.808), p = 0.903] predicted ω. For scale-score moderators, mean BPS [b = −0.612, 95% CI (−1.587, 0.363), p = 0.186] and sd of BPS [b = −0.738, 95% CI (−2.385, 0.909), p = 0.332] were also non-significant. Overall, residual heterogeneity remained very high (I^2^ ≈ 95–96%), indicating that most between-study variability in ω was not explained by these moderators with the available k (10–11 studies per model).
Subgroup analyses for categorical moderator variables (ω)
Subgroup analyses using a common-τ^2^ model compared university students and the general population (Table 6). The between-groups test was significant, Q(1) = 5.445, p = 0.020, with higher reliability in general population samples [ω = 0.894, 95% CI (0.863, 0.917); k = 6] than in university student samples [ω = 0.828, 95% CI (0.766, 0.874); k = 4]. The sample-group moderator explained ≈34.7% of the between-study variance (R^2^; see Supplementary Table S3). Other categorical moderators were not analyzed for ω due to feasibility (sparse cells for Region, Scale language) or no variability (Publication type: all articles).
Discussion
Reliability is crucial in psychological assessment because it ensures the consistency and accuracy of the data collected. Unreliable data can compromise the validity of research findings and lead to incorrect conclusions. Using the REGEMA framework, the present study aimed to evaluate the reliability of the Bedtime Procrastination Scale (BPS) across diverse cultural, linguistic, and sample characteristics. Accordingly, reliability generalization meta-analyses were conducted to estimate the pooled reliability of the BPS using two internal consistency coefficients—Cronbach’s alpha and McDonald’s omega—and to investigate potential moderator variables that may account for variability in reliability estimates across individual studies. The results indicated that the pooled reliability estimates were 0.855 for Cronbach’s alpha and 0.867 for McDonald’s omega. It should be noted that the pooled McDonald’s omega estimate was based on a smaller number of studies. The pooled reliability estimates were higher than the commonly accepted threshold of 0.70 (Cohen et al., 2022; George and Mallery, 2020; Nunnally and Bernstein, 1994). While this cut-off is considered sufficient for studies focusing on predictive or construct validity (Nunnally and Bernstein, 1994), higher thresholds of 0.90 or 0.95 are recommended in contexts involving high risk or critical decision-making (Cohen et al., 2022; Nunnally and Bernstein, 1994). From a construct validity perspective, the pooled reliability estimates obtained in this study can therefore be considered acceptable.
In addition, 95% prediction intervals were estimated for both Cronbach’s alpha and McDonald’s omega (for Cronbach’s alpha: 0.6629–0.9373; for McDonald’s omega: 0.714–0.938). Prediction intervals provide an estimate of the range within which reliability coefficients of future studies are expected to fall (Higgins et al., 2009; IntHout et al., 2016). The relatively wide prediction intervals observed in this study indicate that caution is warranted, particularly in situations involving high-stakes or critical decisions.
Identifying sources affecting the homogeneity of reliability is another key point of the study. After estimating pooled reliability, the homogeneity of reliability coefficients was assessed using Cochran’s Q, the I^2^ index (Higgins and Thompson, 2002), and the H^2^ statistic. The results showed that in the analysis of internal consistency coefficients such as Cronbach’s alpha and McDonald’s omega, there was a significant degree of variability between studies, commonly referred to as inter-study heterogeneity. Moderator analyses revealed that various characteristics were statistically significant predictors of variability in reliability estimates. However, the heterogeneity explained by these moderators was generally found to be low. This may be partly due to the large number of studies included in the analysis, whereby even small differences in a large sample can become statistically significant (Borenstein et al., 2009). Furthermore, it highlights the need for careful interpretation of statistical significance and that it should not be equated with practical importance.
In this study, mean age, standard deviation of age, proportion of female participants, sample size, mean BPS scores, and standard deviation of BPS scores were included as continuous moderators, and analyses were conducted to account for the observed heterogeneity in Cronbach’s alpha and McDonald’s omega coefficients. Meta-regression results for Cronbach’s alpha revealed that both mean age and age variability (standard deviation of age) were significantly associated with reliability, indicating that higher average age and greater dispersion in participant ages corresponded to increased Cronbach’s alpha values. Supporting this perspective, Schipke and Freund (2012) reported that samples consisting solely of adults or solely of older individuals negatively affected reliability, whereas larger sample sizes tended to have a positive influence. By contrast, no significant associations were observed for female proportion, sample size, mean BPS score, or standard deviation of BPS score, which showed that gender distribution, sample size, and score averages were not statistically significant predictors of heterogeneity of reliability. The fact that the effect of the female ratio on reliability coefficients is statistically insignificant indicates that similar levels of reliability evidence have been achieved in terms of the gender variable. Indeed, Franco-Jimenez (2024), in their study on the Bedtime Procrastination Scale, examined measurement invariance by gender and found that the scale provided measurement invariance across genders. This finding provides evidence for the construct validity of the scale across genders. Therefore, it can be said that the finding that reliability does not differ by gender is consistent with previous findings regarding construct validity from a psychometric perspective. Sample size was also a non-significant moderator. The content of BPS may facilitate stable reliability estimates, reducing the need for large samples to achieve adequate measurement precision. Notably, in the original development study of the instrument, data collected from 177 participants already provided sufficient evidence for the psychometric properties. For McDonald’s omega coefficients, however, none of the continuous moderator variables, including mean age and age variability, were found to be significant. This may be attributable to the relatively small number of studies reporting omega, which reduced the power of the significance test of the moderators in participant age across samples. Indeed, the studies included in the omega analyses predominantly encompassed participants with a narrower age range compared to those in the Cronbach’s alpha analyses.
Subgroup analyses of Cronbach’s alpha for categorical moderator variables revealed significant effects for region and sample group, whereas scale language and publication type were not significant predictors of variability in reliability coefficients. With respect to region, studies conducted in Europe yielded higher reliability estimates compared to those conducted in Asia. For the categorical moderator variable of sample group, which distinguished between adolescents, university students, and the general population, results indicated that the general population group demonstrated significantly higher reliability compared to both adolescents and university students. However, no significant differences were observed between the adolescent and university student samples. Bruna et al. (2018) associated the high heterogeneity between samples with the variability of reliability. Despite the presence of significant categorical moderators, neither scale language nor publication type emerged as a significant moderator. Non-significance of the language can be explained by the psychometric properties of the BPS. The instrument is a brief, unidimensional measure consisting of nine items that are conceptually straightforward and terminologically simple. These features likely facilitate practical adaptation across languages while maintaining the integrity of the factor structure, thereby preserving response consistency across translations. Moreover, although the original validation study was conducted with a relatively small sample, construct validity was nonetheless supported, further suggesting that the scale’s structure is robust and easily replicable across different linguistic and cultural contexts. For McDonald’s omega, categorical moderator analyses were limited to sample type (university students and the general population). The results paralleled those of Cronbach’s alpha, indicating that measurements obtained from general population samples demonstrated higher reliability than those from university student samples.
A relatively large number of moderator analyses were conducted to explore potential sources of heterogeneity. As noted in the meta-analytic literature, although multiple testing can increase the risk of Type I error, there is no consensus on how this issue should be handled in subgroup analyses or meta-regression (Borenstein et al., 2009). Accordingly, rather than applying a uniform multiplicity correction across all tests, moderator results were interpreted cautiously and in context, with emphasis placed on the consistency, direction, and theoretical plausibility of effects rather than on isolated p-values. This approach is consistent with recommendations for exploratory moderator analyses in meta-analysis.
The findings indicate that the BPS demonstrates consistently high internal consistency across both Cronbach’s alpha and McDonald’s omega coefficients, providing convergent evidence for the scale’s reliability. Nonetheless, while reliability was sufficient for the majority of research objectives, greater caution is recommended in high-stakes or high-risk contexts where measurement precision is crucial. It is important to note that reliability constitutes a necessary but not sufficient condition for accurate measurement. In order to meaningfully interpret the data and make practical decisions, there is also a need for evidence of validity. However, the present RGMA did not address this additional aspect. The moderator’s analysis underscores the necessity for careful interpretation, suggesting that greater emphasis should be placed on the consistency and theoretical plausibility of observed patterns rather than on isolated significance tests. The results obtained for alpha and omega are largely compatible, thereby reinforcing confidence in the internal consistency of the BPS. However, they concurrently highlight the broader measurement considerations that extend beyond reliability alone.
Limitations
In the present study, pooled reliability estimates were calculated, and subsequent analyses were conducted based on all reliability coefficients, regardless of the specific research context. However, the intended purpose of the instrument within each study, such as providing evidence for construct validity versus supporting decision-making about individuals, was not taken into account. Considering that the BPS has been linked to various health outcomes, including insomnia, obesity, and diabetes, it is important to recognize that future RGMA analyses of BPS should consider the specific contexts and applications of the scale. Conducting subgroup analyses based on these contextual distinctions may provide more nuanced insights into the reliability of the instrument in relation to its uses in different fields, including psychological and clinical settings.
Implications
The findings of this study have important implications for both research and applied contexts. Although pooled reliability estimates were examined across all studies regardless of their specific context, the results highlight that certain moderators, such as age, region, and sample type, can meaningfully influence reliability outcomes. This suggests that researchers and practitioners should consider these factors when interpreting BPS scores, particularly in high-stakes or clinical contexts where decisions may have significant consequences.
With regard to prospective research, the non-significant moderators in this meta-analysis, including publication type, language, sample size, mean and standard deviation of BPS scores, and percentage of women, imply that these factors may have a negligible influence on reliability. Nevertheless, future studies could further explore their potential effects. Investigating the BPS in novel cultural, clinical, or applied settings may help identify conditions under which these variables could become more relevant. This would ultimately improve the generalizability and interpretability of the scale across diverse populations.
Conclusion
This meta-analysis aimed to comprehensively review the reliability of the Bedtime Procrastination Scale (BPS) scores across diverse cultural and linguistic samples, considering methodological characteristics. The study demonstrated that the BPS exhibits strong internal consistency overall, including across gender, different versions of the scale, publication types, and sample sizes. It is vital for future studies to continue looking at the BPS in different research settings, such as clinical and medical decision-making, as well as large-scale empirical research, to look for ways to improve the accuracy and practical use of the tool in a range of research and applied situations.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Aguayo-Estremera R. Vargas-Pecino C. de la Fuente Solana E. I. Lozano Fernández L. M. (2011). A meta-analytic reliability generalization study of the Maslach burnout inventory. Int. J. Clin. Health Psychol. 11, 343–361.
- 2Åkerstedt T. Nilsson P. M. (2003). Sleep as restitution: an introduction. J. Intern. Med. 254, 6–12. doi: 10.1046/j.1365-2796.2003.01195.x, 12823638 · doi ↗ · pubmed ↗
- 3Ali B. T. A. Saleh N. O. Mreydem H. W. Hammoudi S. F. Lee T. Chung S. . (2021). Screen time effect on insomnia, depression, or anxiety symptoms and physical activity of school students during COVID-19 lockdown in Lebanon: a cross sectional study. Sleep Med. Res. 12, 101–109. doi: 10.17241/smr.2021.01109 · doi ↗
- 4Alshammari T. K. Rogowska A. M. Basharahil R. F. Alomar S. F. Alseraye S. S. Al Juffali L. A. . (2023). Examining bedtime procrastination, study engagement, and studyholism in undergraduate students, and their association with insomnia. Front. Psychol. 13:1111038. doi: 10.3389/fpsyg.2022.1111038, 36733877 PMC 9886684 · doi ↗ · pubmed ↗
- 5American Psychological Association (2020). Publication manual of the American Psychological Association. 7th Edn. Washington DC: American Psychological Association.
- 6An H. Chung S.-J. Suh S. (2019). Validation of the Korean version of the Bedtime Procrastination Scale in young adults. J. Sleep Med. 16, 41–47. doi: 10.13078/jsm.19030 · doi ↗
- 7An Y. Zhang M. X. (2024). Relationship between problematic smartphone use and sleep problems: the roles of sleep-related compensatory health beliefs and bedtime procrastination. Digit. Health 10. doi: 10.1177/20552076241283338, 39291154 PMC 11406640 · doi ↗ · pubmed ↗
- 8Andrews J. L. Lokesh L. (2024). The relatıonshıp between bedtıme procrastınatıon and emotıonal ıntellıgence among college students. Annals of the Bhandarkar Oriental Research Institute 2, 133–141.
