\'Evaluation des capacit\'es de r\'eponse de larges mod\`eles de langage (LLM) pour des questions d'historiens
Mathieu Chartier, Nabil Dakkoune, Guillaume Bourgeois, St\'ephane Jean

TL;DR
This study evaluates the ability of various large language models to accurately and reliably answer history questions in French, revealing significant shortcomings in accuracy, language handling, and response consistency.
Contribution
It provides a systematic assessment of LLMs' performance on French history questions, highlighting their limitations in accuracy and language quality.
Findings
LLMs show overall insufficient accuracy in historical responses.
Responses exhibit uneven handling of the French language.
Responses are often verbose and inconsistent.
Abstract
Large Language Models (LLMs) like ChatGPT or Bard have revolutionized information retrieval and captivated the audience with their ability to generate custom responses in record time, regardless of the topic. In this article, we assess the capabilities of various LLMs in producing reliable, comprehensive, and sufficiently relevant responses about historical facts in French. To achieve this, we constructed a testbed comprising numerous history-related questions of varying types, themes, and levels of difficulty. Our evaluation of responses from ten selected LLMs reveals numerous shortcomings in both substance and form. Beyond an overall insufficient accuracy rate, we highlight uneven treatment of the French language, as well as issues related to verbosity and inconsistency in the responses provided by LLMs.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
