Phare: A Safety Probe for Large Language Models

Pierre Le Jeune; Beno\^it Mal\'ezieux; Weixuan Xiao; Matteo Dora

arXiv:2505.11365·cs.CY·May 27, 2025

Phare: A Safety Probe for Large Language Models

Pierre Le Jeune, Beno\^it Mal\'ezieux, Weixuan Xiao, Matteo Dora

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

Phare is a comprehensive multilingual diagnostic framework designed to evaluate large language models across safety dimensions like hallucinations, biases, and harmful content, revealing systematic vulnerabilities and guiding improvements.

Contribution

It introduces a novel safety probing framework that identifies specific failure modes in LLMs, moving beyond performance metrics to enhance safety and trustworthiness.

Findings

01

Systematic vulnerabilities across safety dimensions

02

Patterns of biases and harmful content generation

03

Insights for improving LLM robustness and alignment

Abstract

Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

(1) Comprehensive and transparent framework. \ Phare integrates hallucination, bias, and harmful-content evaluations into a single, modular pipeline with open data, clearly defined prompts, and reproducible scoring methods. The framework is well-documented and replicable, setting a strong standard for benchmark transparency. (2) Rigorous statistical testing and validation. \ The analysis incorporates confidence intervals, significance testing with FDR correction, and human verification of LLM-j

Weaknesses

(1) Language aggregation and performance variation. \ It is unclear whether headline results are averaged across English, French, and Spanish or reported for English only. Aggregation can obscure meaningful cross-lingual differences, and the appendix suggests noticeable performance variation across languages, with English often strongest and other languages sometimes reversing rankings. Clarifying how results are aggregated and briefly discussing these cross-lingual trends would make the finding

Reviewer 02Rating 4Confidence 2

Strengths

- The writing and figures are clear and easy to follow. - Unlike many English-centric benchmarks (e.g., those derived primarily from Wikipedia), Phare intentionally draws from culture-specific sources to diversify prompts. - Safety is assessed across multiple dimensions within one framework.

Weaknesses

- Experiments center on English, French, and Spanish—three Indo-European languages—providing narrow evidence for “multilingual” claims. Typologically distant and low-resource languages are not represented, limiting generalizability. - The manuscript emphasizes multilinguality (Table 1 and narrative) but omits discussion and comparisons to MultiJail [1], a multilingual safety benchmark covering 10 languages. This omission weakens the novelty and positioning. - Although culturally grounded prompts

Reviewer 03Rating 4Confidence 3

Strengths

1. The paper introduces a novel diagnostic benchmark that focuses on identifying how and why LLMs fail on safety dimensions, rather than just ranking models. 2. The study offers practical insights showing that prompt style, user confidence, and system brevity instructions directly influence factual reliability. 3. It is methodologically comprehensive and reproducible, releasing data and code across hallucination, bias, and harmfulness evaluations.

Weaknesses

1. **Lack of Coherent Design Across the Three Dimensions** The primary concern is that the benchmark’s three dimensions (bias & fairness, hallucination, harmfulness) appear to be designed and evaluated independently, without a unifying methodological framework. As a result, the benchmark feels more like a concatenation of three separate datasets rather than a single, well-integrated evaluation suite. A clearer central design philosophy or shared task structure would strengthen the contribution.

Code & Models

Repositories

giskard-ai/phare
noneOfficial

Datasets

giskardai/phare
dataset· 451 dl
451 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Ethics and Social Impacts of AI