Assessing the Sensitivity and Alignment of FOL Closeness Metrics

Ramya Keerthy Thatikonda; Wray Buntine; Ehsan Shareghi

arXiv:2501.08613·cs.CL·September 8, 2025

Assessing the Sensitivity and Alignment of FOL Closeness Metrics

Ramya Keerthy Thatikonda, Wray Buntine, Ehsan Shareghi

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the effectiveness of various metrics in assessing the correctness and similarity of First-Order Logic statements generated by language models, highlighting their sensitivities and alignment with LLM judgments.

Contribution

It provides a comprehensive analysis of existing FOL similarity metrics, revealing their sensitivities and proposing combined metrics for improved robustness and alignment with LLM evaluations.

Findings

01

BLEU is oversensitive to text perturbations

02

Smatch++ responds to structural operator changes

03

BertScore aligns more closely with LLM judgments

Abstract

The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language (NL) statements into First-Order Logic~(FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text, often go unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we conduct a comprehensive study on the sensitivity of existing NL-, FOL-, and graph-based metrics to capture differences between a sampled FOL and its corresponding ground-truth. We then measure the alignment between a metric-based ranking of FOL outputs and a strong LLM as-a-judge. To do this, we first apply operator and text-based perturbations to ground-truth FOL statements to assess metric sensitivity. We then evaluate metric robustness by comparing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ramyakeerthy/alignmentfol
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsConsumer Retail Behavior Studies · Aviation Industry Analysis and Trends · Customer churn and segmentation