An Analysis of Multilingual FActScore
Kim Trong Vu, Michael Krumdick, Varshini Reddy, Franck Dernoncourt,, Viet Dac Lai

TL;DR
This paper investigates the limitations of the FActScore metric in multilingual contexts, introduces a new multilingual dataset, and proposes mitigations to improve factuality estimation across diverse languages.
Contribution
It provides the first comprehensive analysis of FActScore in multiple languages, highlighting challenges and proposing solutions to enhance its reliability.
Findings
LLMs show inconsistent FActScore across languages.
Knowledge source quality significantly impacts FActScore accuracy.
Mitigations improve FActScore reliability in low-resource languages.
Abstract
FActScore has gained popularity as a metric to estimate the factuality of long-form texts generated by Large Language Models (LLMs) in English. However, there has not been any work in studying the behavior of FActScore in other languages. This paper studies the limitations of each component in the four-component pipeline of FActScore in the multilingual setting. We introduce a new dataset for FActScore on texts generated by strong multilingual LLMs. Our evaluation shows that LLMs exhibit distinct behaviors in both fact extraction and fact scoring tasks. No LLM produces consistent and reliable FActScore across languages with varying levels of resources. We also find that the knowledge source plays an important role in the quality of the estimated FActScore. Using Wikipedia as the knowledge source may hinder the true FActScore of long-form text due to its limited coverage in medium- and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSimulation and Modeling Applications
