ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

Olga Golovneva; Moya Chen; Spencer Poff; Martin Corredor; Luke; Zettlemoyer; Maryam Fazel-Zarandi; Asli Celikyilmaz

arXiv:2212.07919·cs.CL·September 13, 2023·28 cites

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning

Olga Golovneva, Moya Chen, Spencer Poff, Martin Corredor, Luke, Zettlemoyer, Maryam Fazel-Zarandi, Asli Celikyilmaz

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

ROSCOE introduces a suite of interpretable, unsupervised metrics designed to automatically evaluate the correctness and quality of step-by-step reasoning in large language models, enhancing interpretability and verification.

Contribution

The paper presents ROSCOE, a novel set of metrics that evaluate reasoning steps for semantic consistency, logicality, and factuality, extending beyond existing evaluation methods.

Findings

01

ROSCOE outperforms baseline metrics on multiple datasets.

02

It effectively measures reasoning traits like logicality and factuality.

03

ROSCOE is applicable across diverse reasoning tasks.

Abstract

Large language models show improved downstream task performance when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness (independent of the final answer) is difficult without reliable methods for automatic evaluation. We simply do not know how often the stated reasoning steps actually support the final end task predictions. In this work, we present ROSCOE, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics. To evaluate ROSCOE against baseline metrics, we design a typology of reasoning errors and collect synthetic and human evaluation scores on commonly used reasoning datasets. In contrast with existing metrics, ROSCOE can measure semantic consistency, logicality,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/ParlAI
pytorchOfficial

Models

Videos

ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning· slideslive

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification