Training data generation for context-dependent rubric-based short answer grading

Pavel \v{S}indel\'a\v{r}; D\'avid Slivka; Christopher Bouma; Filip Pr\'a\v{s}il; Ond\v{r}ej Bojar

arXiv:2603.28537·cs.CL·April 1, 2026

Training data generation for context-dependent rubric-based short answer grading

Pavel \v{S}indel\'a\v{r}, D\'avid Slivka, Christopher Bouma, Filip Pr\'a\v{s}il, Ond\v{r}ej Bojar

PDF

TL;DR

This paper presents methods for generating large-scale training data for context-dependent rubric-based short answer grading using small confidential datasets and simple derived text formats.

Contribution

It introduces novel techniques to create surrogate datasets from limited data, enhancing automatic grading model training and performance.

Findings

01

Successfully created three surrogate datasets more similar to the reference data.

02

Early experiments indicate potential improvements in automatic answer grading.

03

Methods preserve confidentiality while enabling large-scale data generation.

Abstract

Every four years, the PISA test is administered by the OECD to test the knowledge of teenage students worldwide and allow for comparisons of educational systems. However, having to avoid language differences and annotator bias makes the grading of student answers challenging. For these reasons, it would be interesting to consider methods of automatic student answer grading. To train some of these methods, which require machine learning, or to compute parameters or select hyperparameters for those that do not, a large amount of domain-specific data is needed. In this work, we explore a small number of methods for creating a large-scale training dataset using only a relatively small confidential dataset as a reference, leveraging a set of very simple derived text formats to preserve confidentiality. Using the proposed methods, we successfully created three surrogate datasets that are, at…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.