Developmental Change in Associations Between Mental Health and Academic Ability Across Grades in Adolescence: Evidence from IRT-Based Vertical Scaling
Yuanqiu Ma, Youyou Duan, Yunxiao Qi, Ying Hu, Tour Liu

TL;DR
This study shows that better vocabulary skills in middle school are linked to lower anxiety and depression, with the strongest effect in Grade 8.
Contribution
The study introduces IRT-based vertical scaling to track academic and mental health links across adolescence.
Findings
Higher vocabulary ability is associated with lower depression, anxiety, and stress in middle school.
The strongest mental health benefit of vocabulary ability occurs in Grade 8.
IRT-based vertical scaling effectively tracks academic and psychological development across grades.
Abstract
Adolescence is a critical period when rapid cognitive maturation coincides with heightened emotional vulnerability. This study examined the dynamic association between academic ability and mental health across early adolescence, focusing on vocabulary ability as a core indicator of academic ability. Using large-scale data from Grades 1–12 (N = 13,412), a vertically scaled vocabulary ability scale was constructed based on Item Response Theory (IRT) and the Non-Equivalent Anchor Test (NEAT) design to achieve cross-grade comparability. Fixed-parameter calibration was then applied to an independent cross-sectional sample of middle school students (Grades 7–9, N = 401) in Tianjin, combined with the DASS-21 to assess internalizing symptoms (depression, anxiety, stress). Hierarchical multiple regression analyses revealed that higher vocabulary ability was significantly associated with lower…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7- —Science & Technology Development Fund of Tianjin Education Commission for Higher Education
- —Tianjin Normal University Research Innovation Project for Postgraduate Students
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychometric Methodologies and Testing · Child and Adolescent Psychosocial and Emotional Development · Cognitive Abilities and Testing
1. Introduction
Early adolescence is a developmental window in which rapid cognitive gains co-occur with heightened emotional vulnerability. During this period, neuroplasticity is markedly enhanced, facilitating learning and growth in academic ability; at the same time, emotion-regulatory functions of the prefrontal cortex are not yet fully mature, rendering adolescents more susceptible to mood lability and mental health problems (Casey et al., 2019; Pfeifer & Allen, 2021). Globally, mental health problems among adolescents constitute a major public health concern, with recent estimates indicating that approximately 10–20% experience anxiety or depressive symptoms and that prevalence rates have continued to rise in recent years (Lu et al., 2024; World Health Organization, 2025). In China, systematic reviews and meta-analyses similarly report high prevalence rates of internalizing symptoms among children and adolescents, suggesting a substantial mental health burden comparable to, or in some estimates exceeding, global averages (Z. Chen et al., 2023; Zhou et al., 2024).
Within China’s Confucian-heritage educational system, academic achievement occupies a central social and cultural position and is widely regarded as a primary pathway to social mobility and family honor. This emphasis, reinforced by parental expectations and social comparison, often translates into sustained academic pressure during adolescence, increasing vulnerability to internalizing problems such as depression and anxiety. At the same time, students in Chinese participating regions consistently demonstrate strong academic performance relative to international benchmarks (Organisation for Economic Co-operation and Development, 2023). Comparative research suggests that this coexistence of high achievement and high academic pressure is particularly salient in East Asian education systems, giving rise to the widely discussed “high achievement–high pressure” profile (Steare et al., 2023). In response, China’s Ministry of Education issued the Double Reduction policy in 2021, aiming simultaneously to reduce academic burden and strengthen school-based mental health services. Against this backdrop, examining the coordinated development of academic ability and mental health in early adolescence is of immediate practical importance and of policy relevance for advancing educational equity and improving school mental-health support systems.
Adolescence is widely recognized as a critical developmental period characterized by elevated risk for mental health problems. A large-scale meta-analysis of 192 epidemiological studies identified the peak age of onset for most mental disorders at approximately 14.5 years, with a substantial proportion of first-episode symptoms emerging before age 18 (Solmi et al., 2022). In Western populations, clear sex differences have been documented, with boys showing higher rates of externalizing problems and girls exhibiting greater vulnerability to internalizing symptoms such as anxiety and depression (Salk et al., 2017). In the Chinese context, cultural socialization processes—including collectivist norms and high societal and academic expectations—may further predispose adolescents to internalized psychological difficulties (X. Chen et al., 2013). Recent systematic reviews and large-scale studies indicate that approximately 22–26% of Chinese children and adolescents report depressive symptoms, and about one quarter report anxiety symptoms, placing China at a medium-to-high level internationally (Xu et al., 2024). Longitudinal and repeated cross-sectional evidence further suggests that internalizing symptoms increase steadily during the junior-high years (roughly ages 12–15), particularly among girls and adolescents exposed to high academic pressure (Liu et al., 2024; Y. Sun et al., 2023; Wu et al., 2022). Thus, this body of evidence highlights early adolescence as a critical window for mental-health risk emergence and an important target period for educational and preventive interventions.
Academic ability is a dynamic construct that evolves with cognitive maturation and educational experience. Longitudinal and growth-model research demonstrates that the development of core academic skills is shaped by the interplay of educational environment, socioeconomic resources, and neurocognitive maturation, showing nonlinear trajectories characterized by alternating acceleration and plateau phases (Erbeli et al., 2021; Little et al., 2021). In reading, fluency growth is fastest in the early grades but slows thereafter, and students exhibit heterogeneous developmental profiles rather than a single path (Khanolainen et al., 2024). The reciprocal association between reading and mathematics is stronger in elementary school but tends to weaken or change direction in secondary education (Gnambs & Lockl, 2023). Longitudinal evidence from Chinese children similarly reveals persistent divergence in vocabulary growth (van der Kleij et al., 2023). With the onset of early adolescence (approximately 10–14 years), neural and cognitive systems undergo major reorganisation: prefrontal functions and executive control develop rapidly, and abstract and formal-operational reasoning begin to emerge (Best & Miller, 2010). Such ability differences are socially manifested through grades, standardized tests, and classroom ranking, triggering social-comparison effects that may intensify achievement anxiety and self-efficacy disparities, leading some students to difficulties in academic pressure and emotional regulation (Jiang et al., 2021). Conversely, higher language and vocabulary ability facilitate emotional expression and social communication, thereby reducing psychological distress (Hentges et al., 2021; Mellado, 2025). Academic ability and mental health are believed to exert bidirectional influences. According to the Attentional Control Theory, anxiety disrupts goal-directed attention and consumes working-memory resources, impairing task performance (Eysenck et al., 2007). Excessive stress and anxiety heighten attention to threat cues, reduce working-memory efficiency, and decrease task persistence, whereas chronic anxiety or depression is associated with lower classroom engagement and achievement (Owens et al., 2012; Linnenbrink-Garcia & Pekrun, 2014).
Cross-sectional studies have consistently revealed a negative association between mental-health indicators and academic ability (Steinmayr et al., 2016); however, such evidence only captures static relationships at a single time point. To compare the strength of this association across developmental stages and to identify its trajectory or lagged effects, longitudinal designs are required (Pekrun et al., 2022). Cross-study comparisons indicate that most longitudinal research supports a negative relationship between mental health and academic ability, yet the magnitude and direction of this association vary across samples, statistical models, and control variables. For instance, longitudinal tracking studies in European samples have found that depressive symptoms predict subsequent declines in academic performance; however, when baseline ability and socioeconomic background are statistically controlled, the association often weakens or becomes non-significant (López-López et al., 2021; Wickersham et al., 2023). In Chinese adolescent samples, some studies have reported a persistent negative predictive effect of anxiety on subsequent Chinese-language achievement, whereas others suggest that this effect may be limited to short-term fluctuations or may vary depending on the measurement approach employed. These inconsistencies not only reflect the stage-specific and context-dependent nature of the relationship between mental health and academic ability but also highlight notable limitations in current psychometric practices (W. Chen, 2025; Ye et al., 2019). At present, many longitudinal studies of academic development still rely on the Classical Test Theory (CTT) framework, in which measurement typically depends on within-grade standard scores, raw scores, or teacher ratings. Such scores primarily reflect an individual’s relative position within a specific sample or test form, rather than representing an interval-scaled latent ability; consequently, cross-grade comparisons may inflate or underestimate true growth (Protopapas et al., 2014). Moreover, item difficulty, scoring standards, and content coverage often differ substantially across grades or test versions. Without employing scaling or linking techniques to place all forms on a common scale, it is impossible to ensure measurement invariance and conceptual equivalence of the latent construct (Kolen & Brennan, 2014). Therefore, longitudinal or cross-sectional studies based solely on CTT scores may conflate true developmental change with measurement error, hindering accurate identification of stage-specific features in the link between mental health and academic ability (Gorter et al., 2015).
Overall, developmental evidence concerning the relationship between academic ability and mental health during early adolescence (approximately 12–15 years) in the Chinese context remains limited. Existing studies have primarily focused on cross-sectional samples or a single educational stage, offering little insight into developmental continuity. To address this gap, the present study adopts a developmental perspective that integrates educational measurement and mental health research to examine whether the association between academic ability and internalizing symptoms follows a dynamic, stage-specific pattern during early adolescence. Vocabulary comprehension was selected as a core indicator of academic ability, given its foundational role in language understanding and its relevance across academic domains (Ricketts et al., 2020). Methodologically, Item Response Theory (IRT) and vertical scaling were used to construct a unified developmental vocabulary ability scale spanning Grades 1–12. This scale was then applied to an independent junior-high sample (Grades 7–9) using a fixed-item-parameter calibration procedure (König et al., 2021), enabling developmentally comparable ability estimates. This measurement framework allowed us to examine cross-grade variation in the association between academic ability and internalizing symptoms without confounding developmental differences with measurement artifacts.
Based on developmental theory and prior empirical evidence, the present study tested two primary hypotheses. First, vocabulary ability was expected to be negatively associated with internalizing symptoms during early adolescence, such that adolescents with higher vocabulary ability would report lower levels of depression, anxiety, and stress (H1). Second, the strength of this association was hypothesized to vary across grade levels, reflecting developmental differences in academic demands and emotional vulnerability during early adolescence (H2). In addition, by comparing IRT-based vertically scaled vocabulary ability estimates with within-grade standardized raw scores, the study examined whether conclusions about academic–mental health associations depend on the measurement framework used.
2. Materials and Methods
2.1. Participants
This study comprised two rounds of data collection. Dataset 1, used for the standardization and scaling of academic achievement measures, was drawn from eight public schools located in Shenzhen, a major city in southern China. The dataset covered primary, junior, and senior high school levels. A multi-site convenience sampling design was adopted, with intact classrooms serving as the testing units. A total of 13,536 native Mandarin-speaking students participated in the assessment. Data screening excluded any cases meeting one or more of the following criteria: (1) missing responses exceeding 10%; and (2) aberrant response patterns, such as invariant answers across items or near-random response behavior. After data cleaning, the final valid sample comprised 13,412 students, aged 6 to 18 years, encompassing Grades 1 through 12 within China’s compulsory-education and general high-school system. Table 1 presents the sample size and descriptive statistics of raw scores by grade level.
Dataset 2 focused on junior-high students and was used to examine the association between academic ability and mental health during early adolescence. Participants were drawn from four public junior high schools in Tianjin. After data cleaning, cases with missing responses or aberrant answering patterns were removed, resulting in a final valid sample of 401 students. Among them, 48.6% were female, and 43.0% were only children. The final sample was distributed across Grade 7 (n = 131; M = 12.36 years, SD = 0.51), Grade 8 (n = 132; M = 13.40 years, SD = 0.52), and Grade 9 (n = 138; M = 14.30 years, SD = 0.50).
2.2. Measures
2.2.1. Depression Anxiety Stress Scales-21 (DASS-21)
The Chinese version of the DASS-21 was administered to assess adolescents’ internalizing symptoms (Gong et al., 2010; Lovibond & Lovibond, 1995). The DASS-21 is a self-report questionnaire consisting of 21 items that measure negative emotional states experienced during the previous week. It comprises three subscales—Depression, Anxiety, and Stress—each containing seven items. All items are rated on a 4-point Likert scale ranging from 0 = “Did not apply to me at all” to 3 = “Applied to me very much or most of the time.” Subscale scores are obtained by summing responses across the seven items within each domain, with higher scores indicating greater symptom severity. In the present sample, all three subscales demonstrated good internal consistency reliability, with Cronbach’s α = 0.83 for Depression, α = 0.85 for Anxiety, and α = 0.86 for Stress.
2.2.2. Standardized Vocabulary Comprehension Tests
Students’ vocabulary ability was assessed using the Chinese Vocabulary Comprehension Test for Primary and Secondary Students. The theoretical framework and original item bank were developed based on the pioneering IRT scaling research of Cao (1999). The item bank consists of 649 five-option multiple-choice items, derived from the high-frequency and core vocabulary of Chinese language textbooks for Grades 1–12, encompassing major semantic categories (nouns, verbs, adjectives, and function words) specified in the national curriculum standards. Each item presents the target word within a short contextual sentence, and students choose the most appropriate meaning from five alternatives. All items were dichotomously scored (1 = correct, 0 = incorrect). To enable vertical linking across grades, the test employed a Non-Equivalent Groups Anchor Test (NEAT) design, in which each grade-level form contained common anchor items as well as grade-specific items. The proportion of anchor items ranged from 18% to 52%. Results from multiple pilot and large-scale administrations confirmed the psychometric robustness of the test. Internal-consistency coefficients were high across grade-level forms (Cronbach’s α = 0.86–0.88), and the vertical-scaling outcomes were stable and reliable. Details of the anchor-item distribution are reported in Table 2.
2.3. Procedure
The present study collected two complementary datasets to address different but interrelated research objectives. Dataset 1 was collected to support IRT calibration and vertical scale construction across Grades 1–12, whereas Dataset 2 was used to examine the association between vocabulary ability and mental health during early adolescence (Grades 7–9). To obtain stable item-parameter estimates, IRT calibration and vertical scaling typically require large samples at each grade level (Embretson & Reise, 2000). Accordingly, Dataset 1 included 13,536 students, with approximately one thousand participants per grade, providing a sufficient empirical basis for cross-grade parameter estimation and linking. For the regression analyses, Dataset 2 comprised 401 junior-high students, which is consistent with commonly accepted guidelines for detecting small-to-moderate effects, including interaction terms, in multiple regression models (Cohen, 1992).
Data for both datasets were collected from primary and secondary school students in two economically developed regions in China, in close collaboration with participating schools. Assessments were administered at the classroom level during regular school hours. Prior to testing, trained research assistants provided standardized instructions to all participants, emphasizing that participation was entirely voluntary and that responses would be kept strictly confidential. For Dataset 1, vocabulary assessments were administered by trained teachers or research staff at participating schools following a standardized testing protocol. The assessment typically required approximately 20–30 min to complete. For Dataset 2, both the vocabulary test and the mental health questionnaire (DASS-21) were administered by trained research assistants in cooperation with school staff. All participants completed the assessments individually within a single session lasting approximately 30–45 min. Data collection for both datasets was completed in late September 2023.
All study procedures were conducted in accordance with established ethical standards. Written informed consent was obtained from all participants and their parents or legal guardians prior to data collection. Participation was voluntary, data were anonymized and securely stored, and participants were informed of their right to withdraw at any time. If a participant’s questionnaire responses indicated potential emotional distress, the research team informed the school in accordance with pre-established collaboration procedures, allowing school staff to provide appropriate follow-up support.
2.4. Vocabulary Ability Scale Construction
To enable meaningful comparisons of vocabulary ability across grades, the scale construction process followed three core steps. First, IRT was applied to estimate latent vocabulary ability while accounting for item difficulty and discrimination. Second, because different grade-specific test forms were used, a vertical linking procedure was implemented to place all item parameters onto a common metric, ensuring cross-grade comparability. Third, based on the linked item parameters, a fixed-parameter calibration approach was used to obtain comparable ability estimates for the junior high school sample analyzed in subsequent models. Together, these procedures ensured that observed grade differences in vocabulary ability reflected developmental variation rather than measurement artifacts.
2.4.1. IRT Modeling
The present study employed IRT to estimate students’ latent vocabulary ability and to construct a common scale suitable for cross-grade comparisons. Unlike raw test scores, which are inherently sample and test-form-dependent, IRT-based ability estimates provide a model-based metric that supports meaningful comparisons across different test forms and grade levels (Embretson & Reise, 2000; Kolen & Brennan, 2014). Several IRT models were compared in terms of model fit and parameter stability (see Appendix A.1), and the two-parameter logistic (2PL) model was selected as the unified framework for vertical scaling due to its balance between interpretability and stability.
Under the 2PL model, the probability that student j with latent ability correctly answers item i was defined as:
where is the discrimination parameter (sensitivity to differences in ability) and is the difficulty parameter, the ability level at which the probability of a correct response is 0.50 (Baker & Kim, 2004).
2.4.2. Vertical Linking
Because vocabulary tests administered at different grade levels necessarily differ in item composition and difficulty, vertical linking is required to ensure that latent ability estimates are expressed on a common developmental scale rather than grade-specific metrics (Lord, 1980; Kolen & Brennan, 2014). The study employed a NEAT design to link scales across grade levels. Two principal strategies were considered for placing item parameters from different grades on a common scale: concurrent calibration and separate calibration (Kolen & Brennan, 2014). The concurrent-calibration approach assumes that a single IRT model provides an adequate fit across grades; however, given the wide developmental range of the present dataset (Grades 1–12), this assumption was considered restrictive. (Embretson & Reise, 2000). To enhance flexibility and accommodate grade-specific characteristics, we adopted a separate-calibration strategy: item parameters were first estimated independently within each grade, followed by post hoc linking across grade-level scales.
Under separate calibration, two independently estimated scales that share common anchor items can be related through an approximately linear transformation (Lord, 1980; Kolen & Brennan, 2014). The transformation from scale p to the reference scale can be expressed as:
where A_p_ > 0 (slope) and B_p_ (intercept) are linking constants. For the 2PL model, item parameters transform as:
Constants were estimated by minimizing discrepancies between transformed anchor-item functions across the two scales (Hambleton et al., 1991).
The study implemented chained linking to vertically connect scales across Grades 1–12, using Grade 7 (G7) as the reference scale. All subsequent transformations and linkages were ultimately anchored to the G7 metric. Grade 7 was selected as the reference level because it lies near the midpoint of the K–12 developmental continuum, thereby minimizing the propagation and accumulation of linking errors across multiple transformations (Battauz, 2015). As shown in Figure 1, these bidirectional linking paths extend both forward and backward across adjacent grades, forming an integrated scale network that unifies all test forms into a single continuous measurement continuum.
The implementation process consisted of three sequential phases: (1) Each grade-level dataset was independently calibrated under the 2PL model to obtain 12 initial sets of item parameters. (2) Using an anchor set (excluding items exhibiting significant differential item functioning), the chained-linking procedure was applied to estimate the final transformation constants A_p_ and B_p_ from each grade to the G7 reference scale. (3) All grade-level item parameters were transformed to the G7 scale according to Equations (2) and (3), thereby establishing a unified cross-grade measurement scale.
2.5. Statistical Analysis
All statistical analyses were conducted in R version 4.2.3 (R Core Team, 2023). For each grade-level dataset, the 2PL model was fitted using the Expectation–Maximization (EM) algorithm implemented in the mirt package (Chalmers, 2012). Prior to parameter estimation, essential unidimensionality was evaluated by fitting a unidimensional IRT model and inspecting the M_2_-based global fit indices. Adequate fit (RMSEA < 0.08, CFI > 0.90, TLI > 0.90; Hu & Bentler, 1999) was taken as evidence supporting essential unidimensionality. To stabilize the linking and ensure anchor-item invariance, all designated anchor items were screened through Differential Item Functioning (DIF) analyses using likelihood-ratio tests for nested IRT models (Thissen et al., 1993). False Discovery Rate (FDR) correction (Benjamini & Hochberg, 1995) was applied to control for multiplicity (α = 0.05), and graphical inspection of item difficulty (b) parameters was used to further exclude biased anchors, ensuring cross-grade stability.
To determine the optimal linking function, four commonly used linking methods available in the equateIRT package (Battauz, 2015) were compared: Haebara, Stocking–Lord, Mean–Mean, and Mean–Sigma. The comparison was based on two criteria: (a) smoothness and monotonic separation of Test Characteristic Curves (TCCs) across grades, and (b) standard errors of the linking constants, where smaller errors indicate greater stability (Kolen & Brennan, 2014). After identifying the optimal method, fixed-parameter calibration was conducted to rescore all examinees, and ability estimates (θ) were obtained using the Expected A Posteriori (EAP) method in mirt, allowing visualization of grade-level latent ability trajectories. To examine how the development of vocabulary ability moderates its association with adolescents’ internalizing symptoms, the fixed-parameter calibration approach was applied to the cross-sectional junior high school dataset (Dataset 2). Using the linked item parameters from Dataset 1 as fixed references, response data from Dataset 2 were analyzed in the mirtCAT package (Chalmers, 2025) to obtain comparable latent ability estimates for Grades 7–9 directly on the established Grade 7 reference scale. These IRT-based vocabulary ability estimates served as the independent variable in subsequent analyses.
The three subscales of psychological health—Stress, Depression, and Anxiety—served as dependent variables. Relationships among variables were examined using hierarchical multiple regression analysis. In the first step, gender and only-child status were entered as control variables. as prior meta-analyses had demonstrated consistent gender differences in internalizing symptoms (Salk et al., 2017), and only-child status has been shown to have cultural specificity in the Chinese sociocultural context (Lin et al., 2021). In the second step, grade level (with Grade 7 as the reference group) and IRT-based vocabulary ability were entered to test their main effects. In the third step, an interaction term (Vocabulary Ability × Grade) was added to test whether the association between vocabulary ability and internalizing symptoms varied across developmental stages. All continuous predictors were mean-centered to reduce multicollinearity and facilitate the interpretation of main effects. Model explanatory power was assessed using the change in R^2^ (ΔR^2^) (Cohen et al., 2003).
3. Results
3.1. Construction and Verification of the Vertical Scale
To ensure the stability of the linking, DIF was examined using IRT Likelihood Ratio Tests (LRT) combined with visual inspection of scatterplots of item difficulty parameters (b) between adjacent grades (see Appendix A.2, Figure A1). After screening, six items with irregular response patterns were removed, and thirteen items exhibiting severe DIF were downgraded to non-anchor status. The final anchor proportions for each grade are reported in Table 3.
Independent 2PL model estimation for each grade (Table 3) showed satisfactory model fit across test forms: most exhibited CFI/TLI values above 0.90 and RMSEA values below 0.05, supporting essential unidimensionality. A few higher-grade forms (e.g., G10: CFI = 0.86, TLI = 0.85, RMSEA = 0.05) demonstrated slightly weaker fit, yet the overall fit remained acceptable for vertical scale construction.
To establish a stable vertical scaling system across Grades 1–12, four IRT-based linking methods (Haebara, Stocking–Lord, Mean–Mean, and Mean–Sigma) were compared (see Appendix B, Table A2). Using the equated parameters, Test Characteristic Curves (TCCs) were plotted for each grade (Figure 2).
The Test Characteristic Curves (TCCs) for each grade exhibited the typical S-shaped growth pattern, where expected scores increased with higher levels of latent ability (θ). Moreover, as grade level increased, the curves shifted progressively rightward, indicating that students in higher grades required greater latent ability (θ) to achieve comparable expected scores. This systematic rightward shift reflects increasing test difficulty across grades, providing evidence for the vertical validity of the scale. Among the four linking methods, both the Haebara and Stocking–Lord approaches yielded the smoothest and most orderly progression of TCCs while maintaining stable linking constants with smaller standard errors. Considering both model stability and transformation precision, the Haebara method was selected as the final procedure for vertical scale construction.
Consequently, the descriptive statistics of the scaled ability and item parameters are summarized in Appendix B (Table A3). In the final longitudinal scale, item discrimination parameters were generally high (mean = 1.34, SD = 0.58), and grade-level mean item difficulty parameters (b) ranged from –3.21 to 1.75, encompassing both lower and higher regions of the latent ability continuum. Figure 3 presents a comparison between the Test Information Functions (TIFs) and the estimated ability distributions across grades. The TIF reflects the measurement precision at each level of latent ability, where higher peaks indicate greater precision (Lord, 1980). Alignment between the TIF peaks and the centers of the ability distributions indicates that measurement precision is maximized at the ability levels most densely represented in the sample, resulting in lower measurement error for the target population. Across all twelve grades, the peaks of the TIFs closely aligned with the centers of the respective ability distributions, indicating that each grade-level test achieved optimal precision for its target population (Embretson & Reise, 2000). These findings suggest that the constructed longitudinal scale demonstrates high measurement reliability and structural stability across grades.
3.2. Developmental Trajectory and Group Variability
As shown in Figure 4, estimates of students’ latent ability (θ) from Dataset 1 (Shenzhen sample) across Grades 1–12 illustrate the developmental trend and inter-group variability in vocabulary ability. With increasing grade level, vocabulary ability exhibited a nonlinear and stage-like developmental pattern. The Grades 6–9 period showed the steepest growth slope, indicating this stage as a critical period of accelerated ability development. The standard deviation of ability scores expanded progressively from the primary grades, suggesting widening individual differences; these differences peaked during Grades 7–9 and then gradually narrowed in the higher grades.
Boxplots were used to visualize the distribution of students standardized latent ability (θ) in Grades 7–9 from Shenzhen (Dataset 1) and Tianjin (Dataset 2), as shown in Figure 5. The comparison revealed a parallel growth trend in ability levels across grades for both regions, indicating a generally consistent developmental pattern. Meanwhile, students from Shenzhen consistently demonstrated higher latent ability levels than those from Tianjin, suggesting systematic regional differences in vocabulary ability performance.
3.3. Descriptive Statistics and Bivariate Correlations
The core analyses focused on the junior secondary school sample from Tianjin, with descriptive and correlational statistics presented in Table 4. Significant positive intercorrelations were observed among the three internalizing symptom dimensions (depression, anxiety, and stress), suggesting strong covariation among these negative emotional states. Vocabulary comprehension reflects a developmental progression across grades: students’ vocabulary ability increased significantly with grade level, with a strong positive correlation observed between grade and ability (r = 0.57, p < 0.001). Moreover, Vocabulary ability showed statistically significant negative correlations with both depression (r = −0.14, p = 0.006) and anxiety (r = −0.16, p = 0.002), while its negative association with stress did not reach significance (r = −0.10, p = 0.051).These results preliminarily suggest that higher vocabulary ability is associated with lower levels of depression and anxiety symptoms.
3.4. Cross-Grade Association Between Vocabulary Development and Internalizing Symptoms
Table 5 presents the results of hierarchical multiple regression analyses predicting three internalizing dimensions—depression, anxiety, and stress. Unstandardized regression coefficients (B) are reported, with vocabulary ability operationalized as IRT-derived latent scores estimated on a common vertical scale. The analyses examined the predictive effects of grade, vocabulary ability, and their interaction, while controlling for gender and only-child status.
In Model 1, which included only control variables (gender and only-child status), the explained variance was low (R^2^ ≈ 0.01), and the overall F-test was non-significant. In Model 2, grade (with seventh grade as the reference group) and vocabulary ability (θ) were added, resulting in a significant improvement in explanatory power (ΔR^2^ ≈ 0.04–0.05). Compared with seventh graders, eighth graders reported higher levels of depressive, anxiety, and stress symptoms, with statistical significance varying across outcomes (Table 5). Ninth graders also showed higher levels of depressive and stress symptoms relative to seventh graders, whereas differences in anxiety were not statistically significant. Across all three internalizing dimensions, vocabulary ability demonstrated a significant negative predictive effect, indicating that higher vocabulary ability was associated with lower levels of depression (B = −1.28), anxiety (B = −1.04), and stress (B = −1.08).
When the interaction term between vocabulary ability and grade was added in Model 3, the model’s explanatory power showed a modest improvement (ΔR^2^ ≈ 0.01), indicating a small but statistically significant contribution of the interaction terms. The interaction between vocabulary ability and grade was significant for eighth grade across all three outcomes. Specifically, the vocabulary ability × eighth grade interaction was negatively associated with depression (B = −2.58), anxiety (B = −1.56), and stress (B = −1.77). In contrast, the vocabulary ability × ninth grade interaction did not reach statistical significance for any of the outcomes. As illustrated in Figure 6, the negative association between vocabulary ability and internalizing symptoms was more pronounced in Grade 8 than in Grade 7, reflecting steeper negative slopes as indicated by the larger negative interaction coefficients in Model 3, with a consistent directional pattern across the three internalizing dimensions.
To evaluate the robustness of the measurement approach, a comparative analysis using within-grade standardized raw scores was conducted (Appendix C, Table A4). Consistent with the main analyses, the IRT-based latent ability scores demonstrated comparatively stronger predictive validity and greater developmental sensitivity than conventional raw-score indicators.
4. Discussion
4.1. Overview of Main Findings
This study examined the association between academic ability and mental health during early adolescence from a developmental perspective. To enable cross-grade comparability, we used IRT to construct a vertically linked vocabulary scale spanning Grades 1–12, thereby placing ability estimates on a common developmental metric. Using this common metric, we examined associations between vocabulary ability and three internalizing symptoms and tested whether these associations varied across Grades 7–9. The results indicated that higher vocabulary ability was associated with lower levels of depression, anxiety, and stress. After controlling for gender and only-child status, these associations remained statistically significant, and the Grade 8 interaction terms suggested a relatively steeper negative association compared with Grade 7, albeit with modest incremental variance explained. Overall, the present study provides empirical evidence on the developmental interplay between academic and psychological functioning in adolescence and illustrates the utility of IRT-based vertical scaling for cross-grade measurement in the Chinese context. These findings contribute empirical evidence to research on academic–mental health co-development in adolescence and demonstrate how IRT-based vertical scaling enables the identification of developmental patterns in academic–psychological associations across grades in the Chinese context.
4.2. Developmental Interpretation of Grade Differences
In a sample of junior secondary students, higher vocabulary ability was significantly associated with lower levels of depression, anxiety, and stress. This finding is consistent with longitudinal and review studies showing that children with weaker language skills are more likely to exhibit internalizing problems during late childhood and early adolescence (Bornstein et al., 2013; Hentges et al., 2021). Language ability may reduce internalizing symptoms through two primary pathways: by facilitating emotion regulation (e.g., cognitive reappraisal or linguistic distancing; Nook et al., 2020, 2025) and by enhancing social competence (e.g., improved peer interaction and emotional support; Wieczorek et al., 2024). In this context, stronger language ability may reflect richer emotional vocabulary and inner speech resources, which could support emotion labeling and cognitive reappraisal as well as more effective social communication. Consequently, students with higher vocabulary ability may report lower levels of internalizing symptoms. Furthermore, the negative association appeared relatively more pronounced in Grade 8 than in Grade 7. Within the Chinese middle-school curriculum structure, Grade 8 marks a significant increase in course difficulty and academic demands, which may heighten the relevance of language-related resources for emotional regulation and coping (J. Sun, 2024). In this context, the association between vocabulary ability and internalizing symptoms may become more salient during Grade 8, even if the overall effect size remains modest.
By contrast, although Grade 9 is generally associated with increasing pressure related to the high-stakes entrance examination, data collection in the present study took place at the beginning of the fall semester, prior to the peak period of exam-related stress. Prior research suggests that exam-related stress and emotional distress tend to intensify as high-stakes examinations approach rather than remaining constant across the school year, with peaks in mental health symptoms often observed during examination periods (George, 2024). As a result, the psychological burden typically associated with imminent entrance examinations may not yet have fully manifested at the time of assessment in the present study. Moreover, in the Chinese context, higher academic burden and examination-related pressure have been consistently associated with depressive and anxiety symptoms among adolescents (Wang et al., 2025). In addition, Grade 9 students may enter a more structured phase of exam preparation (e.g., standardized instruction and collective training), which could reduce between-student variability in study routines and perceived stress early in the semester. Such contextual arrangements may reduce between-student variability in emotional and stress responses during the early semester, thereby attenuating the statistical detectability of interaction effects at this stage.
4.3. Methodological Implications
Methodologically, this study goes beyond prior work that has relied primarily on CTT-based raw scores or within-grade standardized scores by adopting an IRT-based vertical scaling framework to support cross-grade comparability of vocabulary ability. By placing academic performance on a common latent scale rather than treating it as grade-specific or sample-dependent, this approach enabled developmentally interpretable comparisons across grades. This measurement strategy proved critical for identifying stage-specific patterns in the association between academic ability and internalizing symptoms. Using a unified ability scale, we were able to examine how the strength of academic–psychological associations vary across developmental stages, revealing a more pronounced association in early secondary school (especially Grade 8). Such patterns may be difficult to detect using conventional within-grade standardization, which removes between-grade variance and may obscure developmental differences.
The use of IRT calibration and linking aligns with established practices in large-scale assessments such as NAEP, PISA, and TIMSS, where common metrics are used to ensure comparability across forms, grades, and populations (Yamamoto & Mazzeo, 1992). Consistent with prior large-scale studies of academic skill development, the resulting vocabulary trajectory showed a nonlinear, stage-specific pattern, characterized by steady growth in primary school, accelerated growth with widening individual differences in early secondary school, and a leveling-off in upper secondary school (Peng et al., 2019). Importantly, by integrating a developmentally comparable academic ability scale with mental health outcomes, the present study illustrates how vertically scaled academic measures can advance research on the dynamic interplay between academic and psychological development during adolescence.
4.4. Limitations and Future Directions
Naturally, this study has several limitations. First, participants were recruited via convenience sampling from public schools in two economically developed Chinese cities, which may limit generalizability to other regions and school contexts. Neither dataset collected individual-level socioeconomic indicators or ethnicity; contextual information was limited to the city/school level. Because Dataset 1 primarily served IRT calibration and vertical scaling, we prioritized large grade-level samples and response-quality screening over detailed background variables, and Dataset 2 included only basic demographics for the regression models. Preliminary checks suggested possible DIF in a small number of items, and mean ability estimates differed across cities, supporting the need for broader multi-site calibration and validation. Future studies should sample more diverse regions and school types and collect richer contextual data to test invariance/DIF and improve the robustness and generalizability of the scale. Second, because the present study adopted a cross-sectional design, causal direction and true developmental trajectories could not be identified, and inferences about underlying mechanisms may be biased. In addition, academic ability was represented solely by vocabulary ability, excluding domains such as mathematics and reading comprehension, which limits a comprehensive understanding of academic–mental health covariation. Future research should employ multi-wave longitudinal designs to examine the temporal ordering and causal effects between vocabulary ability and internalizing symptoms. It is recommended that future work integrate mathematics and reading-comprehension measures within the unified scaling framework and include process-level tracking of mental health (e.g., emotion-regulation tasks or experience sampling) to enhance the explanatory power of mechanism testing. Furthermore, future studies could extend hybrid IRT approaches, such as multi-group IRT, Bayesian hierarchical modeling, and NEAT-linking comparisons across regions to systematically evaluate DIF and measurement fairness, thereby improving cross-regional comparability.
5. Conclusions
Using an IRT-based vertically scaled vocabulary metric, this study examined cross-grade patterns in the association between vocabulary ability and internalizing symptoms during early adolescence. Higher vocabulary ability was associated with lower levels of depression, anxiety, and stress, and this association showed modest but grade-specific variation, appearing relatively stronger in Grade 8 than in Grade 7. Beyond these substantive findings, the study highlights the value of vertically scaled measures: placing vocabulary ability on a common metric enabled the detection of stage-specific association patterns that may be attenuated or obscured when using within-grade standardized scores. Together, these results contribute evidence on academic–mental health interplay in early adolescence and underscore the importance of cross-grade comparable measurement for developmental research.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Baker F. B. Kim S.-H. Item response theory: Parameter estimation techniques 2nd ed.CRC Press 200410.1201/9781482276725 · doi ↗
- 2Battauz M. Equate IRT: An R package for IRT test equating Journal of Statistical Software 201568712210.18637/jss.v 068.i 07 · doi ↗
- 3Benjamini Y. Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing Journal of the Royal Statistical Society: Series B 1995571289300
- 4Best J. R. Miller P. H. A developmental perspective on executive function Child Development 20108161641166010.1111/j.1467-8624.2010.01499.x 21077853 PMC 3058827 · doi ↗ · pubmed ↗
- 5Bornstein M. H. Hahn C.-S. Suwalsky J. T. D. Language and internalizing and externalizing behavioral adjustment: Developmental pathways from childhood to adolescence Development and Psychopathology 201325385787810.1017/s 095457941300021723880396 PMC 4151616 · doi ↗ · pubmed ↗
- 6Cao Y. The development of a vocabulary ability scale for junior high school students Acta Psychologica Sinica 1999312215221
- 7Casey B. J. Heller A. S. Gee D. G. Cohen A. O. Development of the emotional brain Neuroscience Letters 2019693293410.1016/j.neulet.2017.11.05529197573 PMC 5984129 · doi ↗ · pubmed ↗
- 8Chalmers R. P. mirt: A multidimensional item response theory package for the R environment Journal of Statistical Software 201248612910.18637/jss.v 048.i 06 · doi ↗
