Argument-Based Comparative Question Answering Evaluation Benchmark
Irina Nikishina, Saba Anwar, Nikolay Dolgov, Maria Manina, Daria, Ignatenko, Viktor Moskvoretskii, Artem Shelmanov, Tim Baldwin, Chris Biemann

TL;DR
This paper introduces an evaluation framework for assessing the quality of comparative question answering, comparing manual annotations and large language models across multiple datasets and criteria.
Contribution
It proposes a comprehensive evaluation framework with 15 criteria for assessing comparative answers, and benchmarks several LLMs and manual annotations on multiple datasets.
Findings
Llama-3 70B Instruct performs best in summary evaluation.
GPT-4 excels in answering comparative questions.
Evaluation data and code are publicly available.
Abstract
In this paper, we aim to solve the problems standing in the way of automatic comparative question answering. To this end, we propose an evaluation framework to assess the quality of comparative question answering summaries. We formulate 15 criteria for assessing comparative answers created using manual annotation and annotation from 6 large language models and two comparative question asnwering datasets. We perform our tests using several LLMs and manual annotation under different settings and demonstrate the constituency of both evaluations. Our results demonstrate that the Llama-3 70B Instruct model demonstrates the best results for summary evaluation, while GPT-4 is the best for answering comparative questions. All used data, code, and evaluation results are publicly available\footnote{\url{https://anonymous.4open.science/r/cqa-evaluation-benchmark-4561/README.md}}.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExpert finding and Q&A systems · Topic Modeling · Speech and dialogue systems
MethodsAttention Is All You Need · Absolute Position Encodings · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer
