MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs
Yucheng Ning, Xixun Lin, Fang Fang, Yanan Cao

TL;DR
This paper introduces MAD-Fact, a multi-agent debate framework designed to evaluate and improve the factual accuracy of long-form outputs from large language models, addressing challenges in high-stakes domains.
Contribution
It presents a novel debate-based verification system and a long-form factuality dataset, advancing evaluation methods for complex, long-text LLM outputs.
Findings
Larger LLMs tend to have higher factual consistency.
Domestic models perform better on Chinese long-form content.
The framework effectively identifies factual inaccuracies in long-form texts.
Abstract
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
