FREAK: A Fine-grained Hallucination Evaluation Benchmark for Advanced MLLMs
Zhihan Yin, Jianxin Liang, Yueqian Wang, Yifeng Yao, Huishuai Zhang, Dongyan Zhao

TL;DR
FREAK is a new benchmark for detailed, fine-grained evaluation of hallucinations in multimodal large language models, revealing significant hallucination issues and providing insights into model reasoning and perception.
Contribution
This paper introduces FREAK, a comprehensive and high-quality multimodal benchmark specifically designed to assess fine-grained hallucinations in MLLMs, addressing limitations of previous benchmarks.
Findings
State-of-the-art models exhibit severe hallucination issues in detailed visual perception.
A curated subset enables indirect evaluation of models' perception of detailed information.
Analysis of Chain-of-Thought prompting reveals patterns in hallucination and reasoning.
Abstract
Multimodal Large Language Models (MLLMs) suffer from hallucinations. Existing hallucination evaluation benchmarks are often limited by over-simplified tasks leading to saturated metrics, or insufficient diversity that fails to adequately assess the hallucination extent in state-of-the-art multimodal models. To address this gap, we propose FREAK, a comprehensive multimodal benchmark designed for fine-grained hallucination assessment in MLLMs. Through high-quality photorealistic images featuring fine-grained counter-commonsense edits, FREAK innovatively evaluates hallucination phenomena in detailed visual perception of MLLMs. Extensive experiments on FREAK show severe hallucination issues in SOTA models regarding detailed visual perception. To enable deeper investigation, we curate a controlled subset to indirectly evaluate the model's ability to perceive target detailed information.…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper introduces FREAK, a large-scale, AI-generated CCS benchmark that evaluates models across six tasks (Detection, Counting, Attribute, Analysis, Position, OCR), offering a comprehensive assessment of fine-grained hallucination. * Experiments show that most SOTA models perform far below human baseline on FREAK (≈45% vs. 86.7%), indicating that the benchmark meaningfully exposes robustness gaps under CCS conditions. * The paper is readable and well-structured, making the motivation, pip
* Novelty concerns: CCS-style evaluation has appeared in prior benchmarks (e.g., PhD), and the six task types are common in MLLM evaluation, which may narrow the incremental contribution despite FREAK’s fine-grained focus. * Although FREAK emphasizes photorealistic images with localized CCS edits, real-world usage spans paintings, animations, and other styles; it’s unclear how well the benchmark reflects such domains or whether deduplication alone guarantees sufficient diversity. * The human b
* This paper highlights the limitations of previous MLLM hallucination benchmarks, such as low quality, oversimplified tasks, and insufficient diversity. The authors clearly compare their proposed benchmark, FREAK, with existing CCS benchmarks and demonstrate its superiority. * The finding that Chain-of-Thought (CoT) reasoning negatively impacts performance on fine-grained hallucination tasks is insightful and well explained. * The experiments are extensive, and the analyses are th
Major Weaknesses * **Dependence on the Image Editing Model:** The quality of the generated CCS images likely depends heavily on the performance of the image editing model. Since this model is not trained specifically on CCS images, it may struggle to edit them properly, potentially leading to corrupted or unrealistic results. It is unclear how the authors ensured that the edited images appear natural. An ablation study or analysis regarding the influence of the image editing model on the benchm
1. The benchmark dataset is developed with strong motivation and should be beneficial for diagnosing models. 2. The blind-test human experiments for validation makes the dataset more solid. 3. The in-depth analysis in sec6 provides insights on why the model hallucinates, potentially inspiring better model development.
1. A major drawback of this work is the taxonomy is not exhaustive or mutually exclusive, nor is it hierarchical. For instance, in fig2, the example question for the “Analysis” dimension also seems to fit into the “Position” dimension. Besides, while spatial relation is included, questions to probe general relations between objects seem missing (e.g. the “riding” relation as in “a person riding a horse”). 2. In sec4.2, why are errors divided into commonsense errors and others? Meanwhile, using
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Face Recognition and Perception · Hallucinations in medical conditions
