Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs

Nan Hu; Jiaoyan Chen; Yike Wu; Guilin Qi; Hongru Wang; Sheng Bi; Yongrui Chen; Tongtong Wu; Jeff Z. Pan

arXiv:2401.14640·cs.CL·July 2, 2025·1 cites

Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs

Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Hongru Wang, Sheng Bi, Yongrui Chen, Tongtong Wu, Jeff Z. Pan

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces CAQA, a large-scale benchmark for evaluating attributed question answering, utilizing knowledge graphs to generate complex attribution scenarios, and assesses various evaluators including LLMs against human judgments.

Contribution

The paper presents CAQA, a comprehensive benchmark with automatic attribution categories and complex scenarios, enabling systematic evaluation of attribution methods in QA.

Findings

01

CAQA effectively differentiates evaluator performance.

02

LLM evaluators show promising alignment with human judgments.

03

Benchmark results highlight strengths and weaknesses of current evaluators.

Abstract

Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA. All the codes and data are…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 6Confidence 5

Strengths

- CAQA uses KGs to generate complex QA benchmarks automatically, enabling scalability and minimizing manual annotation effort. - Different reasoning complexities are considered, highlighting LLMs' capabilities in handling logical relationships between facts. - The benchmark includes fine-grained attribution categories.

Weaknesses

- The task setting seems very similar to NLI to me, more discussions are needed. - Lack of a few details about the human annotation process. - The distribution of the complexity is biased.

Reviewer 02Rating 6Confidence 4

Strengths

1. This paper introduces CAQA, a large-scale benchmark for evaluating complex attributions in QA. 2. The CAQA dataset contains various new definitions (e.g., fine-grained attribute categories and attribution complexities), and the data construction process is automatic, considerate, and comprehensive. 3. This paper contains comprehensive experiments. In addition to model performance on CAQA, it also includes fine-grained analysis, human consistency, and out-of-distribution data.

Weaknesses

1. This paper only considers GPT-3.5 and GPT-4 as closed-source LLMs, and some open-source LLMs used may be outdated (e.g., Mistral-7B has revolutionized various versions). Adding more diverse and latest models in experiments would have greater contributions and help to discover which LLMs perform best on this challenging task. 2. There is a lack of comparisons with human performance on (a subset) of the dataset, which would better illustrate the performance gap and the challenge of the dataset.

Reviewer 03Rating 5Confidence 4

Strengths

The dataset is relevant for the important topic of answers with attributions from LLMs. Being able to carefully validate whether an answer actually follows from the sources is an important skill, and this dataset aims at helping with this. The paper is well written, clearly describing the approach. The use of the KG to create various incorrect attributions, together with using LLM to rewrite at text, seems quite effective. The paper provides access to the full dataset for exploration which is

Weaknesses

While breaking down the non-supportive cases into three subcategories can be helpful for understanding limitations, the boundary between them can be quite unclear. Also the prompt for the non-GPT models doesn't go into great detail (beyond some examples) on what each category means. For instance, the "contradictory" evidence is often for actual true facts, so they're not actually contradiction, it's just the "wrong" evidence. E.g., the answer "The person who founded the United States Coast Guar

Videos

Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Graph Neural Networks

MethodsSparse Evolutionary Training