Aligning NLP Models with Target Population Perspectives using PAIR: Population-Aligned Instance Replication

Stephanie Eckman; Bolei Ma; Christoph Kern; Rob Chew; Barbara Plank; Frauke Kreuter

arXiv:2501.06826·stat.ME·August 27, 2025

Aligning NLP Models with Target Population Perspectives using PAIR: Population-Aligned Instance Replication

Stephanie Eckman, Bolei Ma, Christoph Kern, Rob Chew, Barbara Plank, Frauke Kreuter

PDF

1 Repo

TL;DR

This paper introduces PAIR, a post-processing method that adjusts training data to better reflect the target population's perspectives, improving model calibration without needing extra annotations.

Contribution

The paper presents PAIR, a novel data replication technique that enhances population representativity in NLP training data without additional annotation collection.

Findings

01

Non-representative annotator pools harm model calibration

02

PAIR effectively improves calibration by replicating underrepresented group annotations

03

Accuracy remains largely unaffected by the replication process

Abstract

Models trained on crowdsourced annotations may not reflect population views, if those who work as annotators do not represent the broader population. In this paper, we propose PAIR: Population-Aligned Instance Replication, a post-processing method that adjusts training data to better reflect target population characteristics without collecting additional annotations. Using simulation studies on offensive language and hate speech detection with varying annotator compositions, we show that non-representative pools degrade model calibration while leaving accuracy largely unchanged. PAIR corrects these calibration problems by replicating annotations from underrepresented annotator groups to match population proportions. We conclude with recommendations for improving the representativity of training data and model performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

soda-lmu/PAIR
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsALIGN