Multi-Domain Explainability of Preferences
Nitay Calderon, Liat Ein-Dor, Roi Reichart

TL;DR
This paper introduces an automated, multi-domain explainability method for preferences in LLMs, combining concept-based explanations with a hierarchical regression model to improve understanding and prediction of preferences.
Contribution
It presents a novel automated approach for generating concept-based explanations across multiple domains and models relationships between concepts and preferences using a hierarchical regression model.
Findings
Outperforms baselines in preference prediction accuracy
Provides explainability that guides LLM outputs effectively
Enhances preference prediction by incorporating human and LLM-based explanations
Abstract
Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Recommender Systems and Techniques · Multimodal Machine Learning Applications
