Consistency Analysis of Assessment Boards in University Entrance Examinations in Spain
Alejandro Veas, José-Antonio López-Pina

TL;DR
This study examines inconsistencies in university entrance exam assessments in Spain, revealing significant differences in grading severity across subjects and universities.
Contribution
The study applies the Many-Facet Rasch Model to analyze rating inconsistencies in EBAU exams, highlighting the need for standardized evaluation criteria.
Findings
Large inconsistencies were found in the rating process across subjects and universities.
Differences in severity levels were observed both within and between universities.
The study suggests the need for better criteria to ensure comparable examination standards.
Abstract
University entrance examinations (EBAU in Spanish) are a nationwide program for assessing student achievement levels in Spain and determining access to public undergraduate degrees. Considering the need to measure the progress of rater performance, this study analyzes rating data from the June 2018 exam sitting in the Valencian Community, Spain. A total of 54 assessment boards and 3000 students from five public universities were observed. The Many-Facet Rasch Model (MFRM) was used as an extension of the one-parameter Rasch measurement model. All facets involved in analyses (examination board severity, subject difficulty, and group performance) were located on a common underlying linear scale. The results showed large inconsistencies in the rating process, with differences in the severity levels of many subjects both within and between universities. This study may serve as a starting…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1- —MCIN/AEI/10.13039/501100011033
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPsychometric Methodologies and Testing · Advanced Causal Inference Techniques · Reliability and Agreement in Measurement
1. Introduction
Assessment processes are an essential tool for training and measuring the impact of educational systems. While the purpose of the assessment may vary, many countries have focused on determining how the qualitative and quantitative analyses of standards impact the quality and accuracy of relevant national examinations (Baird et al., 2000; Blyth, 2014; Coe, 2007; He et al., 2018; Meadows et al., 2025; Stringer, 2012).
In Spain, one of the main assessment instruments is the “Baccalaureate Examinations for University Entrance” (EBAU). These examinations are designed to assess the academic readiness of students who wish to enter higher education. To that end, the EBAU allows students to demonstrate their knowledge of and skills in various subjects after completing two Baccalaureate courses.
There have been several attempts in Spain to analyze the difficulty or inter-subject comparability of examinations (Cuxart-Jardí, 1996, 2000; Cuxart-Jardí & Longford, 1998). However, the political and administrative regulation of the country produces difficulties in statistically ensuring equal examinations. Each Spanish Region is autonomous and can design specific tests, following minimal regulation standards according to Spanish law (Order PCI/12/2019). In addition, the analysis of difficulties among examinations has been linked to possible rater effects (European Commission/EACEA/Eurydice, 2009). In this regard, Eckes (2009) proposed two main sources of error that affect objectivity in assessment: distal and proximal factors. Distal factors refer to aspects such as student characteristics and assessment context. Some variations are related to gender and age. Proximal factors, however, refer to the construct being measured, such as the structure of score categories and grading criteria.
In high-stakes assessments, multiple studies reported relevant results regarding rater effects and the impact of rubrics and standard criteria. Jonsson and Svingby (2007) stated that analytical rather than holistic rubrics are preferable in this type of assessment. In a recent study, Holmes et al. (2018) detected a strong correlation between expected difficulty using the comparative judgment approach and actual difficulty derived from a Rasch analysis of the GCSE mathematics examination. Thus, considering the relevant research on this topic, comparability measures would be more reliable indicators if judgment criteria were also explored. This study aims to explore the consistency rating of boards in Spanish university entrance examinations.
1.1. Antecedents
Prior studies have mainly focused on developing predictive explanatory models of achievement in Spain, including in the EBAU context. During the 1980s and 1990s, inequality access was explored through analyzing selection criteria (Cuxart-Jardí et al., 1997; Goberna et al., 1987). These studies confirmed the importance of considering the heterogeneity of measures and their impact on both university enrolment and objective selection criteria. Differences could be found in personal variables, chosen scientific degree, and the relationships between the predictive power of subjects and undergraduate degrees analyzed (Cuxart-Jardí, 2000; Rodríguez-Menéndez et al., 2014; Ruiz et al., 2011). In addition, quasi-experimental studies were conducted to analyze the differences between raters and boards in both total and partial scores of examinations. For instance, Escudero-Escorza and Bueno-García (1994) studied the scoring patterns of two additional boards that assessed eight examinations. They detected insignificant and minimal differences in the proportion of students who pass the entire examination process and concluded that the correction patterns were quite similar among the boards. In another quasi-experimental study, Watts and García-Carbonell (1999) selected two random boards of four raters to detect possible differences in assessment patterns in the English examination, depending on the use of holistic or focused-holistic correction criteria. The results confirmed higher levels of inter-rater reliability and accuracy when using focused-holistic criteria.
Beyond variable comparison, statistical procedures in EBAU were also considered. Gaviria (2005) explored various statistical techniques (classical, ordinary least squares, multilevel, average equalization, and standard deviation methods) to analyze the equivalence of baccalaureate grades with EBAU exam grades, with the latter being an anchoring point as it is a standardized test for all students. The results showed that non-classical methods produce better results than the classical weighting method and make for a fairer student selection process. To include modern statistical methods, such as those applied in the international context, Veas et al. (2020) investigated the construct comparability approach (Coe, 2008, 2010) through the Rasch model to compare both fit and difficulty levels among subjects from the 2018 EBAU exam sitting in the Valencian Community. The results showed adequate difficulty levels across the student ability continuum, though a lack of discrimination between high and low ability students was detected. Important conclusions are addressed, such as the influence of the difficulty of the subject selected by students in the examination process and the need to review score categories across subjects to obtain more reliable scores.
In the international context, the possible influence of raters has also been explored (Bejar, 2012; Congdon & McQueen, 2000; European Commission/EACEA/Eurydice, 2009). For assessment procedures implemented at various educational levels, most studies reported rater effects on the use of rubrics and assessment criteria. Jonsson and Svingby (2007) claimed that analytic rather than holistic rubrics are better in this type of assessment. Moreover, Holmes et al. (2018) showed that the pairwise comparison method had high levels of predictions of difficulty in mathematics tests, confirming the efficacy of methods such as the Rasch model. In this vein, the scientific literature confirms the need to include rater effects in statistical modeling to ensure the reliability and consistency of cut-off scores (Benton & Elliot, 2016). The use of predictions between cohorts of students examined in different years to ensure reliable standards in official examinations is most common in the UK (Benton & Sutch, 2014; Leckie & Baird, 2011).
1.2. The Present Study
Assessment presents a complex structure in the design process at all educational levels. According to Mislevy et al. (2003), the real challenge of all systems is to provide a framework supported by cognitive psychology, measurement models, information technology, and learning in disciplines. However, there are several differences in standard methods applied in various countries and even in inferences or arguments used to confirm the assessment structure used (Baird et al., 2024; Ehren, 2023). The present study tries to bridge this gap by extending advanced measurement models into EBAU, one of the main types of examinations in Spain. Indeed, analyses of raters or examination boards have received some attention in the last few decades; however, in spite of their importance for measuring grading quality, these studies used small samples of examinations and boards in local areas. Evidence shows that the level of demand and difficulty among different exam boards should be controlled by statistical evidence gathered from a range of sources (Baird et al., 2000; Robinson, 2007). Therefore, this study has two main aims: (a) to test the severity and leniency levels of examination boards in an official EBAU exam sitting in a region of Spain; (b) to explore severity and leniency levels applied to each subject for each participant university.
Despite having no previous studies to determine clear directional hypotheses, previous difficulty-level analyses with the Rating Scale Rasch Model (RSRM, Andrich, 1978) showed adequate difficulty levels in the analysis of subjects. It is, thus, possible to obtain positive results, considering the specific and public assessment rubrics used by examination boards.
This study is structured to present a detailed analysis of assessment consistency in the EBAU system. The subsequent section outlines the research methodology, which employs the Many-Facet Rasch model to examine scoring variances across multiple examination boards, addressing potential rater effects and the complexities of the evaluation process. The findings section then conveys the results regarding the score disparities, supporting the need for standardized assessment practices to promote fairness. Finally, the discussion section reflects on the implications of these findings for the educational policy, with recommendations aimed at fostering inter-subject comparability and enhancing transparency in the assessment system.
2. Materials and Methods
2.1. Participants
We used a subsample from a large-scale study based on a national research project, which aimed to explore the measurement properties of EBAU examinations in Spain. In our study, 3000 participants were randomly selected from 19,690 students in the June 2018 exam sitting, with an equal proportion of students recruited from three provinces (Alicante, Valencia, and Castellón), for estimating parameters; this was the only region from which the rating score system could be analyzed. A total of 54 examination boards were used for data analyses. Individual raters could not be considered because, to ensure anonymization, the local government only provided data from general boards. Consequently, only the total number of boards was computed, together with the total number of students assigned, the province, and the university. The frequency distributions of boards and students can be seen in Table 1 and Table 2, respectively.
2.2. Instruments
This study used a selected group of EBAU examinations administered in the Valencian Community, which was the same for all three provinces (Alicante, Valencia, and Castellon). The selection criterion was a minimum of 600 students tested per academic subject. This was used to ensure the greater precision of the estimated parameters (He et al., 2018). The academic subjects selected were Castilian Language and Literature, History of Spain, English Language, Mathematics II, Applied Mathematics for Social Sciences II, and Valencian Language.
2.3. Procedure
Permission was first obtained from the University Regulation Service (SIIU), the formal institution belonging to the Autonomous Government of Valencia, which provided grades from all students enrolled. For the present research, data from June 2018 were taken for analysis, considering 54 examination boards across the three provinces. Table 2 shows the number of students enrolled per examination board, as well as the provinces and cities where examinations were held. Data were totally anonymized and registered at SIIU, and no previous informed consent was needed. Panel boards are coordinated by an EBAU commission, which comprises representatives from the Department of Education (e.g., the chief of the educational assessment section), high school directors, educational inspectors, and both university and high school professors.
Panel boards have two main functions: (1) to create the questions for the specific subject’s official EBAU call and (2) to elaborate and justify the standard correction criteria and the marking score distributed for each question. Boards are assigned to a concrete local area and a university.
2.4. Data Analysis
For the present study, the Many-Facet Rasch Model (MFRM, Linacre, 1993) was used as an extension of the Rasch measurement model (Rasch, 1960). The MFRM has important advantages over classical data analysis. First, all facets involved in analysis (examination board severity, subject difficulty, and group performance) are located on a common underlying linear scale. This results in a measure that can be subjected to traditional statistical analysis, while allowing for unambiguous interpretation of group performance as it relates to board severity and subject (or item) difficulty. Second, calibration does not depend on the sample used. This means that the Rasch technique removes the influence of sampling variability from its measures so that valid generalizations can be made. In our specific EBAU analysis, this was important for dealing with missing data, as not all students were enrolled in the same subjects, and not all boards rated the same students. Third, fit procedures exist to derive unexpected response patterns useful for evaluating the extent to which individual groups, tasks, or raters (in our case, examination boards) behave in ways inconsistent with the measurement model (Engelhard, 1992).
The MFRM allows users to create a single interval of scores relevant to both the difficulty of subjects and the ability of the persons tested. These scores are usually reported in logits due to their easy arithmetic properties in terms of interpretations. In this study, the MFRM was deemed a more effective statistical procedure since multiple examination boards assess different groups of students based on standard rubrics for each subject. Moreover, the MFRM does not require assumptions about sampling or the normality of distributions; thus, it is useful for performance assessment with different subject structures at any educational level, even higher education. The formulation of the facet model to perform the analysis in this study was conducted as follows:
where rater is the university, item is the subject, and step is the grade obtained in the subject. The parameter estimation was performed using the marginal maximum likelihood method.
For this analysis, the adequacy of scores was checked with values of examination boards’ mean square residuals (MNSQs), following criteria from Eckes (2005) and Wolfe and McVay (2012). Estimators of the severity of each board, together with estimating the mean score severity for each subject, were used to determine the adequacy of scores.
Unweighted mean square and weighted mean square statistics were used to check rater severity levels from a Rasch measurement perspective. Weighted mean square statistics give more importance to boards aligned with the ability level of subjects. More weight is provided to these subjects as they can carry more information about board ability. The weighted mean square was defined as follows:
where is the rating obtained for the student, is the probability of obtaining that rating based on the proposed measurement model, and .
In contrast, unweighted mean square statistics give more importance to boards that rated subjects positively far above their ability level (Bond & Fox, 2007). The unweighted mean square was defined as follows:
where N is the sample size, and all other terms are interpreted as in WMS.
Values of unweighted and weighted mean squares (MNSQs) can range from 0 to positive infinity. Values below 1 indicate a better fit than expected using the model, whereas values above 1 indicate a poorer fit than expected using the model. In this case, the appropriate range for fit statistics was 0.5–1.5 (Wright & Linacre, 1994). These fit statistics could be approximately normalized using the Wilson–Hilferty transformation with t-values. The Wilson–Hilferty t-transformation has the following form:
where MS is the mean square weighted or unweighted value, and σ is the standard deviation of the mean square. This transformation follows a normal distribution, with a mean of 0 and a standard deviation of 1, where values within a range from −2 to +2 are considered appropriate, although this statistic may not follow a normal distribution in high sample sizes. Conquest version 2.0 (Wu et al., 2007) was used to conduct analyses in this study.
3. Results
Rater effects were assessed by examining the severity level of each board (Table 3). All estimations had adequate unweighted and weighted fit indices, indicating the good fit of the data to the model. Important heterogeneity levels were detected among boards within and between universities. The most severe was board 5-3, with 0.725 logits, followed by board 4-8, with 0.658 logits.
Table 4 shows the total severity levels of boards for each subject. Applied Mathematics for Social Sciences (MCS) had the highest severity score, with 0.611 logits, followed by Mathematics (MAT), with −0.015; Castilian Language and Literature (CAS), with −0.099; English Language (ENG), with −0.675; Valencian Language and Literature (VAL), with -0.736; and History of Spain (HES), with −1.070 logits. Students enrolled in experimental sciences have higher severity scores compared to those enrolled in humanities or social sciences.
These differences are more extreme when comparing the severity mean scores of subjects across universities (see Table 5 and Figure 1). An example is the Valencian Language, as the severity mean score was 0.022 for the University of Valencia and 0.734 for the University Miguel-Hernández (a difference of 0.712 logits). There are no subjects with a similar pattern of severity mean scores across the five universities. Thus, students clearly benefit or are punished depending on the board to which they have been assigned and even the province where they are located. The Wright map confirms the visualization of these differences in the latent scale when considering the interaction of university and examinations (see Figure 1). This figure shows the distribution of students on the left axis of the figure from less ability (bottom) to more ability (top). The results of the estimated ability parameters suggest a normal distribution. Since, in the many-facet model, ability and difficulty parameters are on the same continuum, universities, subjects, and boards appear on adjacent axes according to the corresponding difficulty. The estimates of the difficulty parameters at the bottom of the figure indicate less university or board difficulty, while those at the top of the figure indicate more university or board difficulty.
4. Discussion
The present study explores the adequacy of the consistency scoring system of the examination boards of EBAU, one of the main examinations in Spain. Identifying and describing a consistent framework for monitoring rater performance is essential for analyzing the use of standards (Myford & Wolfe, 2009) and for ensuring student equity.
The results suggest a lack of homogeneity in the rating score system, even with the clear published standards that should be applied by all examination boards. According to Engelhard and Wind (2018), these results affect two main principles applied to rater-mediated assessment: the rater-invariant measurement of persons (the measurement of persons must be independent of the particular raters that happen to be used in assessment) and the invariant locations of raters (locations of raters must be independent of the particular persons, cues, and rating scales used in assessment). Therefore, students can benefit or be punished depending on their location or subject. This might also affect subjects selected by students, based on their future academic choice (knowledge field of the undergraduate degree).
Given the nature of EBAU, the diversity of the format of rating used across subjects may have some impact on the scoring consistency of examination boards. Some subjects, such as English Language, have multiple-choice and short-answer items, and they are given a more objective correction system where success or failure has a direct score. However, in others, such as History of Spain, students must write an elaborate comment on a text or respond to broader questions. These types of written compositions are related to expanded rubrics, where the range of ratings can determine more differences among raters to accurately measure the construct. In this regard, we believe that Spanish national and local educational institutions should provide more consistent guidance for high-stakes examinations. It would be necessary to collect national data from individual raters to obtain more reliable results concerning rating accuracy. Moreover, analyzing rating performance over time would provide relevant information, for instance, to choose strategies to detect and measure changes in rater behavior or to select a frame of reference to depict drift (Myford & Wolfe, 2009).
A further important issue is student equality regarding university access in Spain. The Spanish autonomous communities create their own exams with minimum standards approved by the Spanish government; nevertheless, students may apply to any public university nationwide regardless of the region they belong to. A lack of comparative analyses of the rating score tendency of the autonomous communities constitutes a serious threat to the rating process. The present study may serve as a starting point to determine future replications for ensuring consistent and generalized rating strategies. Regarding examination entry policy, as students may receive a different impact on their achievement according to rating severity, the number and type of subjects chosen may vary across autonomous communities. If so, this strategy overlaps with the need to choose more exams to achieve a better access grade, affecting current achievement levels (Newton, 1997).
Given the importance of an accurate grading system in the EBAUs and the results obtained, the authors of this manuscript propose some organizational and methodological strategies that can be effective in the future. In the first place, it would be necessary to systematically determine the same analytical level of correction criteria across examinations, as it seems to help ensure grading objectivity and reduce error deviations (Watts & García-Carbonell, 1999). As this procedure has been more studied in the field of English Language (Barkaoui, 2010; Yan & Chuang, 2023), it can be extended to other examinations that involve replies to similarly formatted questions that balance the holistic and analytic approaches (Tomas et al., 2019). Secondly, it is necessary to assess the reliability of the grading process. To this end, the leniency or severity levels of boards can be examined across consecutive years to check standard stability over time. This measure would lead to a more realistic comparison of regions or examination districts.
Finally, it is important to mention some limitations. First, only data from one region of Spain (Valencia) were used to analyze rater effects due to formal restrictions or difficulties in accessing data in other regions. Additionally, only numbered boards were registered, and not individual raters. Autonomous regions in Spain do not have a mandatory protocol to register individual raters, as, normally, this type of data is not required to be collected from the University Information System. We would like this study to be a starting point for checking and elaborating new data collection protocols for all the regions of the country to ensure the general equivalence of standards among exam boards regarding achievement criteria. Additionally, this study does not include data from different cohorts across time. The research literature has shown that the behavior of individual raters may change over time, and that DRIFT-differential rater functioning (Wolfe & Moulder, 1999) is ultimately a necessary measurement process for detecting differential accuracy and scale category use (Myford & Wolfe, 2009; Wolfe et al., 2001). We hope that this type of analysis can also be implemented in the future with extended samples and individual raters from different regions of Spain.
In conclusion, this study serves as a critical foundation for contemplating the inherent shortcomings of the EBAU system, particularly concerning equitable access. Political and social debate have discussed the negative consequences of an unequal selection of students in tertiary education (Faura-Martínez et al., 2022; Lamprianou, 2009). Therefore, it is strongly necessary that national policy prioritizes coordinated efforts among educational institutions and local authorities to develop effective strategies aimed at enhancing inter-subject comparability in assessments.
A comprehensive exploration and comparison of various assessment methodologies is necessary to foster greater uniformity in the difficulty levels of examinations, especially in the area of assessment rater effects (Guo & Wind, 2021; Zabala-Delgado, 2021). Such an approach would ensure that the implementation of the unique district procedure remains transparent, equitable, and ultimately conducive to a fairer selection process according to the promotion of social justice in the international context (Volante et al., 2024). Moreover, effective educational reform is contingent upon active participation by stakeholders in the policy development process, ensuring that reforms are responsive to actual educational needs and challenges (Poskitt, 2023). National policy reform would acknowledge the urgent need for systemic changes that address these critical issues, thereby reaffirming the commitment to equity and excellence in education across all autonomous communities in Spain.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Andrich D. A rating formulation for ordered response categories Psychometrika 19784356157310.1007/BF 02293814 · doi ↗
- 2Baird J.-A. Cresswell M. Newton P. Would the real gold standard please step forward?Research Papers in Education 200015221322910.1080/026715200402506 · doi ↗
- 3Baird J.-A. Godfrey-Faussett T. Allan S. Macintosh E. Hutchinson C. Wiseman-Orr L. Standards as a social contract in curriculum-based qualifications: Stakeholder views in Scotland Cambridge Journal of Education 202454445547410.1080/0305764 X.2024.2377965 · doi ↗
- 4Barkaoui K. Variability in ESL essay rating processes: The role of the rating scale and rater experience Language Assessment Quarterly 201071547410.1080/15434300903464418 · doi ↗
- 5Bejar I. I. Rater cognition: Implications for validity Educational Measurement 20123132910.1111/j.1745-3992.2012.00238.x · doi ↗
- 6Benton T. Elliot G. The reliability of setting grade boundaries using comparative judgement Research Papers in Education 201631335237610.1080/02671522.2015.1027723 · doi ↗
- 7Benton T. Sutch T. Analysis of the use of Key Stage 2 Data in GCSE predictions Published Report Ofqual 2014
- 8Blyth K. Selection methods for undergraduate admissions in Australia: Does the Australian predominate entry scheme the Australian Tertiary admission Rank (ATAR) have a future?Journal of Higher Education Policy and Management 20143626827810.1080/01587919.2014.899049 · doi ↗
