Phare: A Safety Probe for Large Language Models
Pierre Le Jeune, Beno\^it Mal\'ezieux, Weixuan Xiao, Matteo Dora

TL;DR
Phare is a comprehensive multilingual diagnostic framework designed to evaluate large language models across safety dimensions like hallucinations, biases, and harmful content, revealing systematic vulnerabilities and guiding improvements.
Contribution
It introduces a novel safety probing framework that identifies specific failure modes in LLMs, moving beyond performance metrics to enhance safety and trustworthiness.
Findings
Systematic vulnerabilities across safety dimensions
Patterns of biases and harmful content generation
Insights for improving LLM robustness and alignment
Abstract
Ensuring the safety of large language models (LLMs) is critical for responsible deployment, yet existing evaluations often prioritize performance over identifying failure modes. We introduce Phare, a multilingual diagnostic framework to probe and evaluate LLM behavior across three critical dimensions: hallucination and reliability, social biases, and harmful content generation. Our evaluation of 17 state-of-the-art LLMs reveals patterns of systematic vulnerabilities across all safety dimensions, including sycophancy, prompt sensitivity, and stereotype reproduction. By highlighting these specific failure modes rather than simply ranking models, Phare provides researchers and practitioners with actionable insights to build more robust, aligned, and trustworthy language systems.
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
(1) Comprehensive and transparent framework. \ Phare integrates hallucination, bias, and harmful-content evaluations into a single, modular pipeline with open data, clearly defined prompts, and reproducible scoring methods. The framework is well-documented and replicable, setting a strong standard for benchmark transparency. (2) Rigorous statistical testing and validation. \ The analysis incorporates confidence intervals, significance testing with FDR correction, and human verification of LLM-j
(1) Language aggregation and performance variation. \ It is unclear whether headline results are averaged across English, French, and Spanish or reported for English only. Aggregation can obscure meaningful cross-lingual differences, and the appendix suggests noticeable performance variation across languages, with English often strongest and other languages sometimes reversing rankings. Clarifying how results are aggregated and briefly discussing these cross-lingual trends would make the finding
- The writing and figures are clear and easy to follow. - Unlike many English-centric benchmarks (e.g., those derived primarily from Wikipedia), Phare intentionally draws from culture-specific sources to diversify prompts. - Safety is assessed across multiple dimensions within one framework.
- Experiments center on English, French, and Spanish—three Indo-European languages—providing narrow evidence for “multilingual” claims. Typologically distant and low-resource languages are not represented, limiting generalizability. - The manuscript emphasizes multilinguality (Table 1 and narrative) but omits discussion and comparisons to MultiJail [1], a multilingual safety benchmark covering 10 languages. This omission weakens the novelty and positioning. - Although culturally grounded prompts
1. The paper introduces a novel diagnostic benchmark that focuses on identifying how and why LLMs fail on safety dimensions, rather than just ranking models. 2. The study offers practical insights showing that prompt style, user confidence, and system brevity instructions directly influence factual reliability. 3. It is methodologically comprehensive and reproducible, releasing data and code across hallucination, bias, and harmfulness evaluations.
1. **Lack of Coherent Design Across the Three Dimensions** The primary concern is that the benchmark’s three dimensions (bias & fairness, hallucination, harmfulness) appear to be designed and evaluated independently, without a unifying methodological framework. As a result, the benchmark feels more like a concatenation of three separate datasets rather than a single, well-integrated evaluation suite. A clearer central design philosophy or shared task structure would strengthen the contribution.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Ethics and Social Impacts of AI
