# The stability of IRT parameters under several test equating conditions

**Authors:** Dominik Weber, Nicolas Becker, Frank M. Spinath, Marco Koch

PMC · DOI: 10.3389/fpsyg.2025.1652341 · Frontiers in Psychology · 2026-01-12

## TL;DR

This study examines how different test equating conditions affect the stability of IRT parameters, emphasizing the importance of anchor items and sample size for accurate results.

## Contribution

The study evaluates a broad range of factors affecting test equating quality, including equating methods, IRT models, and sample sizes.

## Key findings

- Low guessing probabilities and high anchor item discrimination significantly improve test equating quality.
- Larger test set sizes and higher anchor item proportions enhance parameter recovery, with partial compensation between the two factors.
- Sample sizes of 100 or more individuals are generally needed for adequate parameter recovery, though smaller samples carry risks.

## Abstract

It is crucial for researchers and test developers to compare results from different test sets (e. g., re-testing, parallel test forms). To ensure comparability, test sets are often linked using anchor items as a common denominator alongside distinct items. To date, most studies on test equating have been limited in scope, typically comparing only absolute numbers of anchor items or focusing on a single IRT model or equating method. Furthermore, previous research has primarily evaluated the absolute deviation of estimated parameters from true parameters. However, in diagnostic contexts, the correlation between these values is often more relevant for ensuring validity and test fairness. Therefore, the aim of this simulation study was to examine the impact of a broad range of key factors on test equating.

We evaluated correlations and recovery indices between predefined true values and values estimated through test equating for three IRT parameters (discrimination, difficulty, and ability). To this end, we varied the equating method (MS, MM, MGM, IRF, TRF), the IRT model (2PL vs. 3PL), guessing probability (0.000–0.250), anchor item proportion (5–25%), test set size (20–80 items), and the discrimination parameters of the anchor items. In addition, we used samples of 25–100 individuals to assess equating quality under challenging conditions as well as samples of 500 and 1,000 individuals to reflect adequate modeling conditions.

Low guessing probabilities and high anchor item discrimination parameters strongly improved test equating quality for all three IRT parameters. Recovery of discrimination and ability parameters increased logarithmically with larger test set sizes and higher anchor item proportions, with each of these two factors partially compensating for reductions in the other. While sample sizes below 100 individuals produced inadequate parameter recovery, samples of 100 or 500 individuals were justifiable under certain conditions. However, samples of only 100 individuals carried a slight risk of non-convergence. The choice of the equating method had rather minor effects and the impact of the IRT model was ambivalent.

These findings highlight the importance of using distractor-free response formats without any guessing probability, anchor items with high discrimination parameters, and large samples to ensure valid test equating. For individual research and test application purposes, we provide a comprehensive data set covering multiple factor levels and a step-by-step simulation guide.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12832260/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12832260/full.md

## References

40 references — full list in the complete paper: https://tomesphere.com/paper/PMC12832260/full.md

---
Source: https://tomesphere.com/paper/PMC12832260