Towards Faithful and Robust LLM Specialists for Evidence-Based Question-Answering
Tobias Schimanski, Jingwei Ni, Mathias Kraus, Elliott Ash, Markus, Leippold

TL;DR
This paper develops a systematic approach to improve the faithfulness and robustness of Large Language Models in Evidence-Based Question-Answering by fine-tuning with high-quality synthetic data and benchmarking their performance.
Contribution
It introduces a novel data generation pipeline with quality filters and four benchmark test sets for evaluating LLMs in Evidence-Based QA.
Findings
Fine-tuning with synthetic high-quality data enhances model performance
Data quality has a greater impact than data quantity on model accuracy
Models show improved robustness on both in- and out-of-distribution data
Abstract
Advances towards more faithful and traceable answers of Large Language Models (LLMs) are crucial for various research and practical endeavors. One avenue in reaching this goal is basing the answers on reliable sources. However, this Evidence-Based QA has proven to work insufficiently with LLMs in terms of citing the correct sources (source quality) and truthfully representing the information within sources (answer attributability). In this work, we systematically investigate how to robustly fine-tune LLMs for better source quality and answer attributability. Specifically, we introduce a data generation pipeline with automated data quality filters, which can synthesize diversified high-quality training and testing data at scale. We further introduce four test sets to benchmark the robustness of fine-tuned specialist models. Extensive evaluation shows that fine-tuning on synthetic data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling
