Loading paper
Quantifying construct validity in large language model evaluations | Tomesphere