Behavioral engagement patterns and psychosocial outcomes in web-based interpretation bias training for anxiety
Jeremy William Eberle, Sonia Baee, Emma Catherine Wolfe, Mehdi Boukhechba, Daniel Harold Funk, Bethany Ann Teachman, Laura Elizabeth Barnes, Haleh Ayatollahi, Haleh Ayatollahi

TL;DR
This study examines how different engagement patterns in a web-based anxiety program affect outcomes, finding that spending more time does not always lead to better results.
Contribution
The study reveals that time spent on a digital mental health program does not consistently correlate with better outcomes, suggesting the need for broader engagement metrics.
Findings
Participants who spent less time on the program improved more in negative interpretation bias during training.
Those who spent less time experienced a significant loss in training gains by follow-up.
Time spent is an incomplete marker of engagement and should be supplemented with cognitive and affective measures.
Abstract
Digital mental health interventions (DMHIs) have the potential to expand treatment access for anxiety but often have low user engagement. The present study analyzed differences in psychosocial outcomes for different behavioral engagement patterns in a free web-based cognitive bias modification for interpretation (CBM-I) program. CBM-I is designed to shift interpretation biases common in anxiety by providing practice thinking about emotionally ambiguous situations in less threatening ways. Using data from 697 anxious community adults undergoing five weekly sessions of CBM-I in a clinical trial, we extracted program use markers based on task completion rate and time spent on training and assessment tasks. After using an exploratory cluster analysis of these markers to create two engagement groups (whose patterns ended up reflecting generally more vs. less time spent across tasks), we used…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Fig 1
Fig 2- —http://dx.doi.org/10.13039/100000025National Institute of Mental Health
- —http://dx.doi.org/10.13039/100000025National Institute of Mental Health
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMental Health Research Topics · Digital Mental Health Interventions · Anxiety, Depression, Psychometrics, Treatment, Cognitive Processes
Background and significance
The prevalence of clinically significant anxiety has grown worldwide, leading to increased demand for mental health services [1]. However, multiple barriers, such as limited availability and high cost of in-person therapy, prevent individuals from receiving adequate services [2]. Digital mental health interventions (DMHIs) have the potential to reduce these barriers and increase access to care [3–5]. For example, web-based cognitive bias modification programs that target interpretation biases (CBM-I) associated with anxiety [6,7] can shift the tendency of anxious people to assign threatening meanings to ambiguous situations by providing practice resolving ambiguous scenarios positively, without requiring a therapist [8].
Although DMHIs for anxiety can yield small-to-moderate reductions in symptoms, high rates of attrition and low user engagement may limit the efficacy of these interventions and challenge their widespread adoption [9–12]. These challenges also apply to CBM-I, which can improve anxiety symptoms but has yielded somewhat mixed findings [13] and high attrition rates in clinical trials [14,15]. Understanding the impact of users’ engagement patterns on their CBM-I outcomes may shed light on these mixed findings, clarify dose-response relationships, identify different patterns of effective engagement and mechanisms of change, and guide efforts to tailor CBM-I to users’ needs [16–20].
Engagement in DMHIs is a multifaceted construct with at least two parts: (a) extent of program use (i.e., behavioral engagement), and (b) the user’s subjective experience throughout the program [21], which involves both cognitive and affective aspects [22]. Most research has analyzed extent of use, which has been assessed with behavioral metrics (e.g., time spent online, completion rate, frequency of use) from questionnaires, ecological momentary assessments, sensors, or system logs [18,20]. Greater use on a given metric is typically expected to yield better outcomes [23], but mixed findings and methodological differences across studies [24] highlight the need for further research analyzing engagement as a more complex construct. In particular, examining correlates of use, such as symptom severity and demographic characteristics, may help reveal the meaning of users’ behavioral engagement (e.g., lack of time spent on a program likely has a different meaning when a user is higher vs. lower in symptom severity), providing a fuller picture of engagement patterns.
The dominant approach to studying the effect of engagement on outcomes in DMHIs has been to analyze associations between individual engagement metrics (or composites of metrics) and outcomes [25–30]. Another approach, which we take in the present paper, has been to group users by multiple engagement metrics (e.g., using data-driven cluster analysis) and analyze whether these groups show differential symptom reduction [16,17,19,30–32]. The former approach is variable centered, focused on average relations between certain variables and outcomes in a single population. By contrast, the latter approach is person centered, focused on specific configurations of values across a set of variables, possibly but not necessarily representing multiple subpopulations, and on the effects of these configurations on outcomes [33,34].
Using cluster analysis to create engagement subgroups is promising when there are multiple possible engagement features because cluster analysis’s use of multiple (vs. single or composite) metrics and its exploratory nature can suggest clusters that differ based on potentially more complex engagement patterns in the data than can be represented by single metrics or predicted a priori [16,17,19,35]. For example, exploratory classification methods have identified Low Engagers, Late Engagers, High Engagers With Rapid Disengagement, High Engagers With Moderate Decrease (in Engagement), and Highest Engagers, with certain group differences in symptom outcomes [16]. This person-centered approach may be especially useful for studying real-world engagement in implementation settings, where users’ varied expectations, goals, needs, and resources may yield more complex engagement patterns than those in efficacy trials [19].
Objective
The present study analyzed whether behavioral engagement subgroups, created using an exploratory cluster analysis of program use markers (i.e., task completion rate, time spent on training and assessment tasks), show differential improvement in anxiety and interpretation bias in a trial of web-based CBM-I administered to anxious adults on our team’s public research website MindTrails. Although we hypothesized that the group(s) with metrics suggesting higher engagement would have significantly better outcomes in general, we did not have hypotheses about the specific (and potentially complex) engagement patterns that the cluster analysis might suggest. After the cluster analysis suggested two groups differing on time spent across tasks, we expected the group that generally spent more time doing the program (including both training and assessment tasks) would improve more (based on the purported positive relation between extent of use and outcomes [23], and on the positive relation in a prior MindTrails study between mean time spent per CBM-I scenario and anxiety improvement) [28]. However, we also recognized plausible alternatives (e.g., taking longer can indicate distraction) [24] and ran post hoc comparisons of the groups on various baseline variables (e.g., demographics) to more fully characterize them. (Given that another post hoc test indicated that the existence of two groups was not significantly more plausible than the existence of one group, we consider the two groups descriptive subgroups of a single population rather than two truly distinct subpopulations [36].)
Materials and methods
Participants and procedure
We analyzed data from 697 community adults with at least moderate anxiety symptoms ( on an adapted version of the Anxiety Subscale of the Depression, Anxiety, Stress Scales-Short Form [37]; DASS-21 AS; i.e., on DASS-42 AS) who had been randomly assigned to the CBM-I condition of the Calm Thinking study [15], a multistage randomized trial (ClinicalTrials.gov ID: NCT03498651) run on the MindTrails website (https://mindtrails.virginia.edu). We focused on participants who started the first of five weekly training sessions. Out of the 984 who started the first session, we excluded a subgroup who were later randomly assigned to receive supplemental telecoaching as part of the parent study (n = 282; see Section A.1 in S1 File), those with repeated eligibility screenings (n = 3), and those identified as outliers in most of the engagement markers (n = 2). For a flow diagram, see Fig 1. (Starting training was defined as viewing instructions on the page before the first CBM-I scenario. Given that 70 of the 697 participants stopped at the instructions page and did not complete any scenarios, as a sensitivity analysis we reran analyses excluding these users [29]. Results were nearly identical [see Tables B.5-B.7 and Fig C.5 in S1 File], so we focus on the original sample.)
Consort diagram.Classification Note. The parent Calm Thinking study attempted to use an algorithm to classify CBM-I participants as lower- vs. higher- risk of dropout so that higher-risk participants could be randomized to either receive supplemental telecoaching or continue with only CBM-I. However, the accuracy of the algorithm was not better than chance, suggesting that the risk classification does not reflect meaningful differences in risk. Thus, the “Lower risk (LR) for dropout” and “Higher risk (HR) for dropout” participants have comparable dropout risk, and the present study’s exclusion of “HR CBM-I + Coaching” participants is an exclusion of participants assigned to an additional intervention, not an exclusion of participants at meaningfully higher risk. For details, see Calm Thinking main outcomes paper. Note. Adapted from the Calm Thinking main outcomes paper CONSORT diagram. Analysis exclusions are included in flow but not analyzed. S1-5 = Session 1-5; FU = Follow-Up. aneeded for stratification. banalyzed sample before exclusions. ccondition classification count before exclusions. d1 did not start S1 training; 1 started S1 training but did not complete classif. measures.
Enrollment opened on March 18, 2019, and closed on April 6, 2020; we analyzed data collected through November 27, 2020. After providing informed consent (by clicking a button indicating that they have read the consent form and agree to participate), participants were asked to complete assessments at pretreatment, after each training session, and at 2-month follow-up. At least 5 days had to pass between finishing one session’s assessment and starting the next session’s training; a mean of 9.93-11.53 days elapsed between consecutive training sessions (see Table B.11 in S1 File). At least 60 days had to pass between completing Session 5’s assessment and starting the follow-up assessment. Participants received 10 after completing the follow-up assessment. The University of Virginia Institutional Review Board for the Social and Behavioral Sciences approved all procedures. For more details about this trial, see the main outcomes paper [15].
Cognitive bias modification for interpretation (CBM-I)
Each CBM-I session, designed to take about 15 min, consisted of a unique set of 40 ambiguous scenarios about everyday situations (e.g., social settings, physical sensations, health concerns) that could prompt anxiety. After seeing a scenario’s title (e.g., “Spotting a Neighbor") and image, participants read the description, which ended with a word fragment that resolved the ambiguity as positive 90% of the time and negative otherwise (e.g., “As you are walking down a crowded street, you see your neighbor on the other side. You call out, but they do not answer you. Standing there in the street, you think that this must be because they were distr_cted" [positive]). Participants selected one or more letters to complete the word fragment, which increased in difficulty across the five sessions. Participants then answered a comprehension question that reinforced the scenario’s meaning and whose format varied. (For details, see Section A.2 in S1 File.)
Outcome measures and covariate
Anxiety symptoms.
Overall Anxiety Severity and Impairment Scale (OASIS).
The Overall Anxiety Severity and Impairment Scale (OASIS), our primary measure of anxiety, is a five-item self-report of the frequency and severity of anxiety and avoidance and the degree of related impairment in the past week (adapted from the original version [38]; see Section A.3 in S1 File for measure details and adaptations). Participants rate each item on a 5-point scale from 0 (lowest frequency/severity) to 4 (highest frequency/severity). The OASIS was assessed at baseline (pretreatment), after Sessions 1-5, and at follow-up. Internal consistency for the analyzed sample using complete item-level data at baseline was good, Cronbach’s .
Depression, Anxiety, Stress Scales-Short Form: Anxiety Subscale (DASS-21 AS).
The DASS-21 AS, our secondary measure of anxiety, is a seven-item self-report of the frequency of anxiety symptoms (especially physical symptoms; adapted from the original version) [37]. Participants rate the extent each item applied to them in the past week on a 4-point scale from 0 (not at all) to 3 (most of the time). The DASS-21 AS was assessed at baseline (screening), after Sessions 3 and 5, and at follow-up. Internal consistency at baseline was acceptable, .
Interpretation bias.
Recognition Ratings (RR).
Recognition Ratings (RR, developed by our team to resemble those in the original version [39]) is our primary measure of interpretation bias. Participants read nine ambiguous scenarios and then for each scenario rate how similar each of four possible interpretations is to the scenario on a 4-point scale from 1 (very different) to 4 (very similar). For details, see Section A.4 in S1 File. Negative and positive bias scores were calculated by averaging ratings of negative and positive threat-relevant interpretations, respectively (per a prior paper) [14]. RR was assessed at baseline (pretreatment), after Sessions 3 and 5, and at follow-up. Internal consistency at baseline was acceptable for both scores, s = .74 and .73, respectively.
Brief Body Sensations Interpretations Questionnaire (BBSIQ).
The Brief Body Sensations Interpretations Questionnaire (BBSIQ, modified from the original version [40]) is our secondary measure of interpretation bias. Participants read 14 ambiguous situations and rate the likelihood of three explanations using a 5-point scale from 0 (not at all likely) to 4 (extremely likely). For details, see Section A.5 in S1 File. A negative bias score was calculated by averaging the ratings for all negative, threat-relevant explanations [14]. The BBSIQ was assessed at baseline (pretreatment), after Sessions 3 and 5, and at follow-up. Internal consistency at baseline was excellent, .
Training confidence.
Item from Readiness Ruler.
In the multilevel model below, we included one item that assessed confidence in online anxiety programs from the Readiness Ruler [41] at baseline as a covariate given that higher confidence was associated with lower dropout and greater improvement in a prior MindTrails study [42]. Participants rated "How confident are you that an online training program will reduce your anxiety?" on a scale from 0 (not at all) to 4 (very).
Engagement markers
We defined program use markers for a cluster analysis (see below) based on task completion rate and time spent on training and assessment tasks. Some prior studies have also treated assessment tasks as relevant to engagement, using rates of completing assessment tasks as engagement markers [18] or including time spent on assessment tasks in predictors of dropout [43] or treatment outcomes [44,45]. We viewed assessment tasks as relevant to engagement in the program for a few reasons. First, interventions can include assessments of their targets and outcomes solely to improve efficacy by providing progress feedback [46] (which users were told they would receive for their OASIS scores at the end of the parent trial) or by eliciting desirable reactive effects from the assessment task itself, even without feedback [47]. Second, the training and assessment tasks, some of which had similar formats (e.g., the CBM-I training task and RR and BBSIQ assessment tasks all present ambiguous scenarios), were slightly intermixed and presented seamlessly at training sessions; thus, users may not have viewed all training and assessment tasks as distinct. Third, interventions can include assessments to adapt treatment based on certain tailoring variables [48], as the parent trial used self-reported responses (i.e., demographics, state anxiety) and passive features (e.g., time spent) from assessment tasks to classify users at risk of dropout [15]. Given that we did not view training and assessment tasks as fully independent, we allowed all tasks to inform the clusters’ potentially complex engagement patterns. (For an initial cluster analysis based on task completion rate but fewer time-related markers, see Section A.6.1 in S1 File.)
Task completion rate.
Each training session and its associated assessment measures were designed to be completed in one sitting; on average, participants logged in once per session. All participants were asked to complete 5 training sessions and 63 assessment measures, for a total of 68 tasks. Each participant’s task completion rate was calculated as the proportion of tasks that were completed, similar to other papers [17].
Time on training components.
We computed time spent (typically by adding reaction times from the start of an exercise to whatever point in the exercise the participant reached) on the following components of each training session, similar to other papers [43]. For participants with repeated entries for an element within a given component (e.g., a page within an exercise), we removed any duplicated entries and then computed the average reaction time for that element.
Time on imagery practice exercise.
This exercise was given at the start of the first training session to prepare participants to vividly imagine CBM-I scenarios. Participants were asked to use all senses to imagine what it feels like to hold a lemon in their hand.
Time on anxiety imagery prime exercise.
Next, in the first training session, participants were asked to imagine themselves in an anxiety-provoking situation that they were likely to experience to activate their anxious thinking about a personally relevant situation.
Mean time per CBM-I scenario across sessions.
The final training component in the first session (and only training component in later sessions) was 40 CBM-I scenarios. First, time spent on an individual scenario was calculated by summing reaction times across its three parts (title and image, description, comprehension question). For the description and question, we included the time taken to submit a correct response. Then, we averaged the time spent on completed scenarios for a given session. Finally, we used those values to compute the mean time spent on an individual scenario across sessions for a given participant [28].
Time on assessment measures.
We analyzed time spent on measures assessed before the start of Session 1 training so that all participants would have at least one data point. These measures are in Table 1 and included both one-time measures (e.g., demographics, mental health history) and repeated measures (e.g., interpretation biases, anxiety and comorbid symptoms, other cognitive mechanisms, wellness) [43]. For measure details, see Section A.7 in S1 File.
Table 1: Assessment measures used for ‘Time Spent’ engagement markers.
For repeated measures, the average time spent across completed administrations of that measure was computed. To avoid collinearity, we computed Pearson product-moment correlations for time spent values across measures and excluded time spent values with correlations greater than .70. Time spent on BBSIQ was excluded given its high correlation with time spent on RR, Mechanisms, and Wellness measures. For correlations among all engagement features, see Table B.13 in S1 File.
Statistical analysis
Data were obtained from the Public Component (https://osf.io/s8v3h/) of the MindTrails Calm Thinking Study’s OSF project, outputted from the study’s central cleaning scripts (ver. 1.0.0) [50]. All analyses except multiple imputation (see below) were done in R (ver. 4.1.1) [51]. All analyses used an alpha level of .05. Data and analysis code are available at the project’s OSF page (https://osf.io/wynxs/) [52].
Cluster analysis.
Outliers, transformations, and missing data handling for engagement markers.
Before conducting the cluster analysis, we excluded outlying values of time-related engagement markers; log-transformed time-related markers (except for time spent on CBM-I scenarios); and standardized all markers, including completion rate (see Section A.8 in S1 File). For details on missing data handling, see Section A.9 in S1 File.
K-means clustering.
To analyze variability in (potentially complex) engagement patterns, we explored different clustering algorithms (unsupervised learning methods that can suggest subgroups in multidimensional data) to group participants based on their engagement markers. After considering K-means clustering, partitioning around medoids, and agglomerative hierarchical clustering while specifying two to four clusters (for an overview of these algorithms, see Section A.6.2 in S1 File), we chose K-means because of its creation of clusters with similar sample sizes, its superior validation indices, and its popularity (see Section A.10 in S1 File). We used the NbClust package (ver. 3.0) [53] to compute the optimal number of clusters and the kmeans function of the stats package (ver. 4.3.1) [54] to create the clusters. The optimal number of clusters was two.
However, a homogeneity test (Steinley Brusco’s lower bound ratio, LBR, test) [55] run post hoc did not support the presence of more than one cluster. Specifically, the ratio of the within-cluster sum of squares for the 2-means solution to the sum-of-squares total (SSE2/SST = .71) was not less than the lower bound of the ratio that is obtainable for splitting a multivariate normal distribution in half (i.e., .36). (Although a Duda-Hart test [56] supported two clusters, we focus on the more conservative LBR test; see Section A.11 in S1 File.) Thus, our clusters are best interpreted as two descriptive subgroups of a single population rather than as two truly distinct subpopulations [36,57]. Notably, a distinction between descriptive (“constructivist”) and inferential (“realist”) goals of a given cluster analysis is not straightforward (nor is the definition of a “real” cluster), and clusters can be used for pragmatic reasons (e.g., exploratory analysis of data patterns and the effects of the patterns on other variables) even if the clusters do not stem from “real” subpopulations (see this review [57]). We view our clusters as useful partitions of a high-dimensional space of engagement features that allow us to analyze the effects of variability in that complex space on treatment outcomes.
To verify that the clusters differed on engagement markers (following [17]), which were not normally distributed [17], we conducted Wilcoxon rank-sum nonparametric tests using the wilcox.test function in the stats package (ver. 4.3.1) [51]. To further characterize the clusters after analyzing their differences in outcome trajectories (see below), we also ran post hoc tests of their baseline differences on selected demographic variables, anxiety symptom and interpretation bias outcomes, comorbid symptoms (depression, Patient Health Questionnaire-2 [58]; alcohol use, Alcohol Use Disorders Identification Test-Concise) [59], and training confidence. We computed the effect size (r) for each rank-sum test [60] using the wilcox_effsize function of the rstatix package (ver. 0.7.2) [61]. For tests of group differences on categorical demographic variables, we conducted chi-squared tests using the chisq.test function of the stats package and computed effect sizes as Cramér’s V [62]. We interpreted r and Cramér’s V per Cohen’s (1988) guidelines [62].
Multilevel modeling
Missing data handling for outcome measures and imputation model.
For details on missing data handling for outcomes (see Table B.2 in S1 File for descriptives for each outcome over time), including fully Bayesian multiple imputation based on our analysis model [63], see Section A.12 in S1 File.
Analysis model.
To analyze the effect of engagement group on outcome trajectories, we fit a separate multilevel model [64] in each of the imputed datasets for each outcome using the nlme package (ver. 3.1-152) [65] and pooled results across datasets using the mitml package (ver. 0.4-3) [66], which follows Rubin’s rules [67]. Time (assessment point) was represented as two piecewise linear trajectories: time_TR_ during the training phase (baseline to Session 5; coded as 0 for baseline, as 1-5 for Sessions 1 through 5, and as 5 for follow-up) and time_FU_ during the follow-up phase (Session 5 to follow-up; coded as 0 for baseline through Session 5 and as 1 for follow-up). (For specific time codings for each outcome, see Section A.13 in S1 File.) Each model included the fixed effects of engagement group (dummy coded as 0 for less time spent and 1 for more time spent), time_TR, timeFU, Engagement Group TimeTR, Engagement Group TimeFU, and the training confidence covariate (grand mean centered); a random intercept; and random slopes for timeTR_ and time_FU. We first interpreted the Engagement Group Time interactions. If one was significant, we analyzed the simple effects of time with a separate model in each group that included the fixed effects of timeTR, timeFU, and training confidence; a random intercept; and random slopes for timeTR_ and time_FU_. For details, see Section A.16 in S1 File. Parameter estimates are reported as unstandardized b with 95% confidence intervals (CIs) from the confint function.
Effect size.
For all outcomes, we computed the standardized mean difference between engagement groups at Session 5 and at follow-up as growth modeling analysis (GMA) d [68] using the pooled standard deviation at baseline. Given that the groups were not randomized, we computed GMA d unadjusted for baseline mean differences in addition to GMA d adjusted for such differences. GMA d unadjusted for baseline differences reflects group differences in both initial status at baseline and growth from baseline to a given time point [69,70]. GMA d adjusted for baseline differences (which is equivalent to Cohen’s d) [70] reflects group differences in only growth. For within-group effect sizes, we computed each group’s standardized mean difference from baseline to Session 5 and to follow-up using the group’s standard deviation at baseline (A. Feingold, personal communication, March 3-4, 2019).
Results
Demographic characteristics
Most participants (M age = 35 years) identified as female (80.9%), White/European origin (70.4%), Not Hispanic or Latino (81.8%), and from the United States (91.7%). For full demographic characteristics and for baseline levels of anxiety symptoms and interpretation biases for the analyzed sample and each engagement group, see Tables B.1 and B.2 in S1 File, respectively.
Cluster analysis
The two clusters significantly differed on all time-related markers and not on task completion rate (see Table 2). Visual inspection of each group’s distribution for time spent on training (see Fig C.1 in S1 File) and assessment (see Figs C.2 and C.3 in S1 File) tasks and for task completion rate (see Fig C.4 in S1 File) revealed that one group spent less time on training tasks (with small-medium effect sizes: rs = .24-.49; Table 2) and assessment tasks (with large effect sizes: rs = .56-.73) than the other group (whereas completion rates were balanced). Thus, we labeled the groups as “Less Time Spent" (n = 386) and “More Time Spent" (n = 311), respectively.
Table 2: Tests of group differences in engagement markers.
Notably, although the clusters differ on average on every time-related feature, this does not mean that all participants in the Less (More) Time Spent group have lower (higher) values than all participants in the other group on every feature; as shown in Figs C.1-C.3 in S1 File, the groups’ distributions on a given feature overlap. As such, the two clusters are not the result of partitioning on every feature individually (e.g., akin to applying a median split to every feature, which would likely face the issue of participants’ falling below the median on some features but above on others). Rather, the clusters are the result of partitioning on all 17 features at once in a 17-dimensional space. Note that this person-centered approach, which allows us to contrast two complex, descriptive patterns of engagement on treatment outcomes, also differs from a variable-centered approach that aggregates across the time-related features. For example, analyzing whether the mean of all time-related features moderates outcomes would address a different question by reducing the multidimensional space to one number (and lose our focus on potentially complex patterns of engagement).
Post hoc tests revealed that the groups also significantly differed on age, race, education, anxiety symptoms (on DASS-21 AS but not OASIS), and negative interpretation bias (on both RR and BBSIQ; see Table B.4 in S1 File). The group that spent less time on the program (including both training and assessment tasks) was younger, identified with multiple races (vs. one) at a higher rate, and had higher levels of education (see Table B.1 in S1 File) and higher baseline anxiety and negative bias (see Table B.2 in S1 File). These effect sizes were medium for age (r = .30; see Table B.4 in S1 File), small for race (Cramér’s V = .16), very small for education and anxiety (rs = .08-.09), and very small or small for negative bias (rs = .08-.12). The groups did not significantly differ on gender, ethnicity, positive bias, depression symptoms, alcohol use, or training confidence (see Table B.4 in S1 File).
Multilevel modeling
Anxiety symptoms.
Contrary to our hypothesis, the slopes of the engagement groups’ trajectories for anxiety symptoms (assessed by OASIS and DASS-21 AS) did not significantly differ during the training or follow-up phases (Table 3).
Table 3: Piecewise linear multilevel modeling fixed effects.
Positive interpretation bias.
Also unexpectedly, the slopes of the groups’ trajectories for positive interpretation bias (assessed by RR) did not significantly differ during the training or follow-up phases (Table 3).
Negative interpretation bias.
For negative interpretation bias assessed by RR, the slopes of the engagement groups’ trajectories significantly differed during the training phase (b = 0.03, p = .040; Table 3). As shown in Fig 2, both groups had significant reductions from baseline to Session 5. However, unexpectedly, the group that generally spent less time on the program (including both training and assessment tasks; b = −0.10, p<.001; Table 4) decreased more than the group that spent more time (b = −0.07, p<.001). This slope difference translates to a small standardized mean difference between groups at Session 5 (d adjusted for baseline differences = 0.27; Table 3). The groups’ slopes also significantly differed from Session 5 to follow-up (b = −0.15, p = .028; Table 3). The group that spent less time significantly increased in negative bias (b = 0.13, p = .003; Table 4), reflecting some loss in training gains, whereas the group that spent more time had no significant change. This slope difference, combined with the slope difference during the training phase, translates to a negligible mean difference between groups at follow-up (d adj. = −0.01; Table 3).
Model-estimated means over time by engagement group.Note. Estimated means (±1 SE) from the piecewise linear multilevel models at mean level of training confidence were calculated in each imputed dataset and pooled across datasets using the testConstraints function of the mitml package (ver 0.4-3) [66]. Plots were created with the ggplot2 (ver. 3.3.5) [84] and cowplot (ver. 1.1.1) [85] packages. Estimates are shown only for timepoints at which the measure was assessed. OASIS = Overall Anxiety Severity and Impairment Scale; DASS-21 AS = Anxiety Scale of Depression Anxiety Stress Scales; RR = Recognition Ratings; BBSIQ = Brief Body Sensations Interpretation Questionnaire.
Table 4: Piecewise linear multilevel modeling simple time effects for outcomes With significant interaction effects.
For negative bias assessed by the BBSIQ, unexpectedly, the groups’ slopes did not significantly differ during the training phase (Table 3). However, the slopes did significantly differ during the follow-up phase (b = −0.15, p = .033). As shown in Fig 2, after both groups had reductions in negative bias during the training phase, during the follow-up phase the group that spent less time again significantly increased in negative bias (b = 0.16, p = .001, Table 4), reflecting some loss in training gains, whereas the group that spent more time had no significant change. The standardized mean difference between the groups at follow-up, based on their similar slopes during the training phase and different slopes during the follow-up phase, was very small (d adjusted for baseline differences = -0.09; Table 3).
Discussion
This study analyzed differences in anxiety and interpretation bias outcomes for different behavioral engagement patterns in anxious community adults undergoing web-based CBM-I. K-means clustering was used to explore potentially complex engagement patterns based on several program use features involving task completion rate and time spent on training and assessment tasks. Based on this analysis, we created two descriptive engagement groups (likely from a single population vs. two truly distinct subpopulations), with one group generally spending more time on tasks (small-medium effects for training tasks, large effects for assessment tasks). Unexpectedly, participants who generally spent more time on the program (including both training and assessment tasks) did not improve in anxiety and positive interpretation bias significantly more than those who spent less time. Further, although both groups significantly improved in negative interpretation bias during the training phase, participants who spent less time improved in negative bias (when assessed by RR) significantly more than participants who spent more time (and post hoc tests found were significantly older and slightly less educated). However, participants who spent less time had a significant loss in training gains for negative bias (when assessed by RR and BBSIQ) during the follow-up phase, whereas participants who spent more time had no significant change.
Comparable improvement on anxiety symptoms and positive interpretation bias
Some person-centered DMHI studies have found that groups reflecting higher engagement patterns improve more than other groups [17], and some variable-centered studies of individual metrics of time spent have found that greater time spent is related to better outcomes [26–28,32]. Given these results and the common hypothesis that greater use yields better outcomes [23], after the cluster analysis suggested two groups differing in time spent on tasks, we expected the group that generally spent more time (to the extent that this reflects higher engagement) to improve significantly more than the group that spent less time. However, prior results are mixed, and the comparable improvements we found between the two groups in anxiety and positive interpretation bias in this study are consistent with the results of other person-centered studies in which all engagement groups improved during the intervention, with no group improving significantly more than the others [19,71]. The comparable improvements are also consistent with other variable-centered studies’ finding no significant relations between single time spent metrics and outcomes [24,25,30].
Greater improvement on negative interpretation bias for Less Time Spent
We did find differential improvement between groups for one outcome, but in an unexpected direction: During the training phase, participants who generally spent less time on the program (including both training and assessment tasks) improved in negative interpretation bias (when assessed by RR) significantly more than those who spent more time. Over five sessions, this slope difference translates to a small standardized mean difference between the groups at the end of the training phase. If spending less time is an actual marker of less engagement (e.g., if it indicates reduced intention and interest such that users are skimming vs. reading program materials) [18], then it is intuitive that participants who spend less time may lose some training gains during the follow-up phase (as we found for both measures of negative bias, leading to a negligible mean difference between the groups at follow-up for RR and a very small mean difference between the groups at follow-up for BBSIQ). However, it is unclear why participants who spend less time would improve more in negative bias (or in any outcome) during the training phase than those who spend more time. This finding highlights issues in interpreting time spent as a marker of engagement.
Issues in measuring engagement
Although spending more time may reflect greater physical energy investment (i.e., behavioral engagement or participation), time spent is a less clear indicator of behavioral investment than the markers used in some prior person-centered studies of engagement groups [17,19,71], such as task completion rate (which was balanced between the groups in our study) and frequency of using the program (which was constrained in our study given that users could not repeat sessions). Effective engagement is investment directed at the focal tasks in the DMHI that are thought to lead to change in outcomes (e.g., targeted mechanisms and downstream symptoms) [22]. For example, in CBM-I, it is primarily the task of vividly imagining emotionally ambiguous scenarios resolving in less threatening ways that is thought to shift interpretation biases and reduce anxiety [8]. However, we do not know exactly how users were spending time, so distraction could have contributed to longer times. Moreover, engagement involves simultaneous cognitive (e.g., selective attention to the task) and affective (e.g., positive emotional reactions to and commitment to the task) energy investments as well [22]. Thus, although time spent may offer some insight into the process by which users complete tasks, it is critical to understand the actual behaviors and cognitive and affective experiences that underlie users’ time spent, while also examining a variety of other markers of behavioral investment (e.g., task completion rates) and its experiential qualities.
Further characterizing engagement groups
Post hoc tests revealed that the groups significantly differed on aspects other than time spent, providing potential insight into our findings for negative interpretation bias. Notably, in our person-centered approach, any engagement group differences on variables that did not enter into the cluster analysis can help inform interpretation of the groups’ engagement patterns. (This contrasts with a variable-centered approach, in which any correlates of a given engagement variable would represent potential confounds of the effect of the engagement variable on outcomes.) Specifically, our post hoc tests raise the possibility that people in the group that generally spent more time on the program, who were significantly older (medium effect) and less educated (very small effect) than people in the group that spent less time, had slower processing speed [72] and lower reading comprehension [73]. These differences would likely add to their time (on both training and assessment tasks, as was observed) and potentially make it more difficult to quickly learn the pattern presented in the training scenarios (that ambiguous situations tend to end in non-threatening ways), whereas participants who spent less time may have more readily learned this pattern in the short term. However, once users who spent more time had improved, perhaps their gains were maintained at follow-up (unlike users who spent less time, who lost some gains) because of their greater physical investment of time during the training phase. In particular, if users who spent more time thought about the scenarios more deeply or imagined them more vividly, this elaborate processing could have facilitated longer-term recall [39,74,75]. By contrast, if the initial learning for users who spent less time was more superficial, it could have eroded more quickly after the active practice of assigning less negative interpretations during CBM-I had ended.
It is notable that participants who generally spent more time had significantly lower anxiety symptoms and negative interpretation bias at baseline (very small to small effects) than participants who spent less time. Moreover, the two groups had comparable depression symptoms, alcohol use, positive interpretation bias, and confidence in the ability of an online training program to reduce their anxiety. The group that spent more time identified with multiple races (vs. one) at a lower rate than the group that spent less time (small effect for race), but the groups did not significantly differ on gender or ethnicity. Although the potential impacts of the group differences in age and education on engagement patterns and treatment outcomes noted above are speculative, they show the value of further characterizing the clusters for clues as to how the clusters might have originated and why the clusters might have some different outcomes.
Limitations
This study has several limitations. First, given that our homogeneity test did not support rejecting the presence of only a single cluster, our clusters are best interpreted as two descriptive subgroups of one population; we cannot infer the existence of two truly distinct subpopulations [36,57]. Moreover, even though we excluded engagement features that intercorrelated at least .70 from the cluster analysis, the retained features still had non-negligible intercorrelations (rs > .1; see Table B.13 in S1 File) that could have contributed to the algorithm’s suggesting two clusters with higher versus lower means across most features (i.e., our More Time Spent and Less Time Spent groups, respectively) [36]. Second, our findings about the effects of engagement patterns on outcomes are not causal. Third, our sample was demographically homogeneous, and we may have introduced sampling bias by excluding participants who were assigned to receive supplemental telecoaching. Further, although all participants had at least moderate anxiety symptoms at baseline (and on average had marked [76] total OASIS scores [11.55; see Table B.2 in S1 File], surpassing the cutoff of 8 for probable diagnosis [77–79], and extremely severe [80] total DASS-21-AS scores [11.27, or 22.54 on the DASS-42-AS]), our results may not generalize to other populations (e.g., those defined by clinical diagnoses; those who would not seek web-based CBM-I). Fourth, the administration of training and assessment tasks in series (e.g., Session 1 assessment was not administered until Session 1 training was finished; vs. in parallel) constrained our selection of engagement markers (e.g., preventing analysis of completion of training tasks separate from assessment tasks), and not all studies include assessment-related markers. Fifth, for markers involving repeated tasks, we used the mean time spent on those tasks across sessions, ignoring likely variability over the study’s course; engagement is dynamic and can vary not only between persons, but also within persons [22].
Future work
Given that engagement is multifaceted, future research on engagement in DMHIs (for both CBM-I and other interventions) should go beyond the typical focus on physical investments (time spent and other behavioral metrics) [18,20]. Future work could assess affective investments (e.g., perceiving tasks as self-relevant, feeling motivated to do tasks) and cognitive investments (e.g., selectively attending to tasks, using mental resources to meet task demands), in addition to physical investments [22], and study how multifaceted patterns of these investments influence outcomes. Another potentially interesting physical marker is time elapsed between sessions (which we did not include in our cluster analysis because, due to attrition, only 399 participants had the data needed to compute such a variable; see Table B.11 in S1 File). It would also be helpful to directly measure some of the different possible reasons for increased time (e.g., distraction, slower processing speed, doubt about responses). If, as we speculated, users who generally spent less time lost some training gains because they processed the training scenarios less elaborately, future CBM-I programs might include techniques to encourage deeper processing during training (e.g., exercises that apply insights from training to users’ own experiences), consolidate gains at end of training (e.g., relapse-prevention exercises), and facilitate skill application after training (e.g., skill reminders, booster sessions). Future work could also identify individual engagement markers that predict outcomes before performing clustering on only the significant predictors [17,30]. To investigate variability in engagement over time, growth mixture models could identify unobserved subgroups from repeated measures of engagement [81,82]. The present study also highlights the value of further characterizing engagement patterns on variables that did not enter into the cluster analysis (e.g., demographic characteristics, symptom severity), a strategy we suggest for future research. Finally, to analyze separate completion rates for training versus assessment tasks, future work should administer these tasks in parallel, allowing users who stop training to continue assessments.
Conclusion
Understanding differences in treatment outcomes for different real-world patterns of engagement with DMHIs is key to clarifying how and for whom DMHIs are effective and to personalizing DMHIs. This study used an exploratory cluster analysis to create two behavioral engagement groups revealed by program use metrics in a clinical trial of a web-based CBM-I program for anxiety. Unexpectedly, the engagement groups (whose patterns ended up reflecting generally more vs. less time spent across tasks) did not significantly predict differential change in most outcomes. Moreover, participants who generally spent less time on the program (including both training and assessment tasks) improved in one outcome significantly more during the training phase (but had a significant loss in gains for that outcome by 2-month follow-up) than those who spent more time (and post hoc tests found were significantly older and slightly less educated). These unexpected group differences in outcomes highlight the challenge of using time spent metrics as markers of engagement and the need to consider cognitive and affective markers in addition to behavioral markers. Future research on complex engagement patterns in DMHIs holds promise for maximizing their potential to increase access to effective treatment for anxiety.
Supporting information legends
S1 FileSupporting information below is available in the online S1 supplement file.Section A NOTES ON MATERIALS AND METHODS
- – A.1 Excluding Telecoaching Participants
- – A.2 Cognitive Bias Modification for Interpretation (CBM-I) Details
- – A.3 Measure Details and Adaptations
- – A.4 Details for Recognition Ratings
- – A.5 Details for Brief Body Sensations Interpretations Questionnaire
- – A.6 Initial Cluster Analysis and Overview of Clustering Algorithms
- – A.7 Assessment Measures Used for Time-Related Engagement Markers
- – A.8 Outlier Handling and Transformations Details
- – A.9 Missing Data Handling for Engagement Markers
- – A.10 Choosing K-Means Clustering With Two Clusters
- – A.11 Post Hoc Duda-Hart Test
- – A.12 Missing Data Handling for Outcome Measures
- – A.13 Coding of Time
- – A.14 Search for Auxiliary Variables
- – A.15 Imputation Model Specifications
- – A.16 Multilevel Model Specifications
Section B TABLES
- – B.1 Demographic Characteristics by Engagement Group
- – B.2 Raw Descriptive Statistics of Outcomes by Engagement Group Over Time
- – B.3 Out-of-Range Scores Across 100 Imputed Datasets
- – B.4 Post Hoc Tests of Engagement Group Differences at Baseline
- – B.5 Tests of Group Differences in Engagement Markers for Sensitivity Analysis
- – B.6 Piecewise Linear Multilevel Modeling Results for Sensitivity Analysis
- – B.7 Piecewise Linear Multilevel Modeling Simple Time Effects for Significant Interaction Effects for Sensitivity Analysis
- – B.8 Internal Validation Measures for Different Clustering Algorithms for 2-4 Clusters
- – B.9 Stability Validation Measures for Different Clustering Algorithms for 2-4 Clusters
- – B.10 Optimal Number of Clusters Suggested by Clustering Validity Indices
- – B.11 Number of Days Elapsed Between Consecutive Training Sessions
- – B.12 Piecewise Linear Multilevel Modeling Random Effects
- – B.13 Correlations Among Initial Engagement Features
Section C FIGURES
- – C.1 Box and Violin Plots of Time on Training Components by Engagement Group
- – C.2 Box Plots of the Log of Time on Assessment Measures by Engagement Group
- – C.3 Density Distributions of the Log of Time on Assessment Measures by Engagement Group
- – C.4 Density Distribution of Task Completion Rate by Engagement Group
- – C.5 Model-Estimated Means Over Time by Engagement Group for Sensitivity Analysis
- – C.6 Visualization of Clusters Using Principal Components for 2, 3, and 4 Clusters
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1COVID-19 Mental Disorders Collaborators. Global prevalence, burden of depressive, anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic. Lancet. 2021;398(10312):1700–12. doi: 10.1016/S 0140-6736(21)02143-7 34634250 PMC 8500697 · doi ↗ · pubmed ↗
- 2Andrade LH, Alonso J, Mneimneh Z, Wells JE, Al-Hamzawi A, Borges G, et al. Barriers to mental health treatment: results from the WHO World Mental Health surveys. Psychol Med. 2014;44(6):1303–17. doi: 10.1017/S 0033291713001943 23931656 PMC 4100460 · doi ↗ · pubmed ↗
- 3Newby K, Teah G, Cooke R, Li X, Brown K, Salisbury-Finch B, et al. Do automated digital health behaviour change interventions have a positive effect on self-efficacy? A systematic review and meta-analysis. Health Psychol Rev. 2021;15(1):140–58. doi: 10.1080/17437199.2019.1705873 31847702 · doi ↗ · pubmed ↗
- 4Murray E, Hekler EB, Andersson G, Collins LM, Doherty A, Hollis C, et al. Evaluating digital health interventions: key questions and approaches. Am J Prev Med. 2016;51(5):843–51. doi: 10.1016/j.amepre.2016.06.008 27745684 PMC 5324832 · doi ↗ · pubmed ↗
- 5Michie S, Yardley L, West R, Patrick K, Greaves F. Developing and evaluating digital interventions to promote behavior change in health and health care: recommendations resulting from an international workshop. J Med Internet Res. 2017;19(6):e 232. doi: 10.2196/jmir.7126 28663162 PMC 5509948 · doi ↗ · pubmed ↗
- 6Beard C. Cognitive bias modification for anxiety: current evidence and future directions. Expert Rev Neurother. 2011;11(2):299–311. doi: 10.1586/ern.10.194 21306216 PMC 3092585 · doi ↗ · pubmed ↗
- 7Mathews A. Effects of modifying the interpretation of emotional ambiguity. J Cognit Psychol. 2012;24(1):92–105. doi: 10.1080/20445911.2011.584527 · doi ↗
- 8Mac Leod C, Mathews A. Cognitive bias modification approaches to anxiety. Annu Rev Clin Psychol. 2012;8:189–217. doi: 10.1146/annurev-clinpsy-032511-143052 22035241 · doi ↗ · pubmed ↗
