Argument-Based Comparative Question Answering Evaluation Benchmark

Irina Nikishina; Saba Anwar; Nikolay Dolgov; Maria Manina; Daria; Ignatenko; Viktor Moskvoretskii; Artem Shelmanov; Tim Baldwin; Chris Biemann

arXiv:2502.14476·cs.CL·February 21, 2025

Argument-Based Comparative Question Answering Evaluation Benchmark

Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria, Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann

PDF

Open Access

TL;DR

This paper introduces an evaluation framework for assessing the quality of comparative question answering, comparing manual annotations and large language models across multiple datasets and criteria.

Contribution

It proposes a comprehensive evaluation framework with 15 criteria for assessing comparative answers, and benchmarks several LLMs and manual annotations on multiple datasets.

Findings

01

Llama-3 70B Instruct performs best in summary evaluation.

02

GPT-4 excels in answering comparative questions.

03

Evaluation data and code are publicly available.

Abstract

In this paper, we aim to solve the problems standing in the way of automatic comparative question answering. To this end, we propose an evaluation framework to assess the quality of comparative question answering summaries. We formulate 15 criteria for assessing comparative answers created using manual annotation and annotation from 6 large language models and two comparative question asnwering datasets. We perform our tests using several LLMs and manual annotation under different settings and demonstrate the constituency of both evaluations. Our results demonstrate that the Llama-3 70B Instruct model demonstrates the best results for summary evaluation, while GPT-4 is the best for answering comparative questions. All used data, code, and evaluation results are publicly available\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExpert finding and Q&A systems · Topic Modeling · Speech and dialogue systems

MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer