Multi-Domain Explainability of Preferences

Nitay Calderon; Liat Ein-Dor; Roi Reichart

arXiv:2505.20088·cs.CL·May 30, 2025

Multi-Domain Explainability of Preferences

Nitay Calderon, Liat Ein-Dor, Roi Reichart

PDF

Open Access 1 Video

TL;DR

This paper introduces an automated, multi-domain explainability method for preferences in LLMs, combining concept-based explanations with a hierarchical regression model to improve understanding and prediction of preferences.

Contribution

It presents a novel automated approach for generating concept-based explanations across multiple domains and models relationships between concepts and preferences using a hierarchical regression model.

Findings

01

Outperforms baselines in preference prediction accuracy

02

Provides explainability that guides LLM outputs effectively

03

Enhances preference prediction by incorporating human and LLM-based explanations

Abstract

Preference mechanisms, such as human preference, LLM-as-a-Judge (LaaJ), and reward models, are central to aligning and evaluating large language models (LLMs). Yet, the underlying concepts that drive these preferences remain poorly understood. In this work, we propose a fully automated method for generating local and global concept-based explanations of preferences across multiple domains. Our method utilizes an LLM to identify concepts that distinguish between chosen and rejected responses, and to represent them with concept-based vectors. To model the relationships between concepts and preferences, we propose a white-box Hierarchical Multi-Domain Regression model that captures both domain-general and domain-specific effects. To evaluate our method, we curate a dataset spanning eight challenging and diverse domains and explain twelve mechanisms. Our method achieves strong preference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Multi-Domain Explainability of Preferences· underline

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Recommender Systems and Techniques · Multimodal Machine Learning Applications