Concept Training for Human-Aligned Language Models
Christine Zhang, Dan Jurafsky, Chen Shani

TL;DR
This paper proposes a concept-based training framework for language models that improves semantic alignment with human judgments and maintains competitive performance by predicting sets of related tokens instead of single tokens.
Contribution
It introduces a novel concept supervision approach that enhances semantic understanding and alignment in language models compared to traditional next-token prediction training.
Findings
Models trained with concept supervision align better with human semantic similarity judgments.
Concept training results in lower perplexity on semantically meaningful words.
There is a tradeoff with a modest increase in global token-level perplexity.
Abstract
The next-token prediction (NTP) objective trains language models to predict a single continuation token at each step. In natural language, however, a prefix can be continued in many valid ways, and even similar meanings may differ in surface form. For example, the sentence ``this website is safe to \underline{browse}'' could plausibly continue with words such as browse, search, visit, surf, or navigate. While standard NTP training treats these alternatives as mutually exclusive targets, we explore a framework that instead predicts concepts, approximated as sets of semantically related tokens. We show that models trained with concept supervision exhibit stronger alignment with human semantic similarity judgments on multiple lexical benchmarks. These gains are accompanied by lower perplexity on semantically meaningful words (definition in Section 3.1), and a modest increase in global…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
