Enabling Classifiers to Make Judgements Explicitly Aligned with Human Values
Yejin Bang, Tiezheng Yu, Andrea Madotto, Zhaojiang Lin, Mona Diab,, Pascale Fung

TL;DR
This paper introduces a framework for creating classifiers that explicitly incorporate human values, using large language models to generate training data, resulting in improved performance and greater inclusivity and explainability.
Contribution
The paper presents a novel value-aligned classification framework that distills human values from large language models to enhance classifier alignment with human values.
Findings
VA-Models outperform baselines by at least 15.56% F1-score
Generated data from LLMs improves classifier performance
Explicit human value input enhances AI inclusivity and explainability
Abstract
Many NLP classification tasks, such as sexism/racism detection or toxicity detection, are based on human values. Yet, human values can vary under diverse cultural conditions. Therefore, we introduce a framework for value-aligned classification that performs prediction based on explicitly written human values in the command. Along with the task, we propose a practical approach that distills value-aligned knowledge from large-scale language models (LLMs) to construct value-aligned classifiers in two steps. First, we generate value-aligned training data from LLMs by prompt-based few-shot learning. Next, we fine-tune smaller classification models with the generated data for the task. Empirical results show that our VA-Models surpass multiple baselines by at least 15.56% on the F1-score, including few-shot learning with OPT-175B and existing text augmentation methods. We suggest that using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
