Latent Principle Discovery for Language Model Self-Improvement
Keshav Ramji, Tahira Naseem, Ram\'on Fernandez Astudillo

TL;DR
This paper introduces a method for language models to self-improve by automatically discovering and applying latent behavioral principles through a self-correction framework, leading to enhanced response quality.
Contribution
It presents a novel approach that mines, compresses, and teaches latent principles within language models to enable automated self-improvement without extensive manual annotation.
Findings
Achieved +8-10% win-rate improvement on AlpacaEval
Improved MT-Bench scores by +0.3 on average
Increased principle-following win-rate by +19-23% on IFEval
Abstract
When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes that guide model reasoning toward human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ a form of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Multimodal Machine Learning Applications
MethodsSparse Evolutionary Training
