Assessing the Sensitivity and Alignment of FOL Closeness Metrics
Ramya Keerthy Thatikonda, Wray Buntine, Ehsan Shareghi

TL;DR
This paper evaluates the effectiveness of various metrics in assessing the correctness and similarity of First-Order Logic statements generated by language models, highlighting their sensitivities and alignment with LLM judgments.
Contribution
It provides a comprehensive analysis of existing FOL similarity metrics, revealing their sensitivities and proposing combined metrics for improved robustness and alignment with LLM evaluations.
Findings
BLEU is oversensitive to text perturbations
Smatch++ responds to structural operator changes
BertScore aligns more closely with LLM judgments
Abstract
The recent successful paradigm of solving logical reasoning problems with tool-augmented large language models (LLMs) leverages translation of natural language (NL) statements into First-Order Logic~(FOL) and external theorem provers. However, the correctness of FOL statements, comprising operators and text, often go unverified due to the lack of a reliable evaluation metric for comparing generated and ground-truth FOLs. In this paper, we conduct a comprehensive study on the sensitivity of existing NL-, FOL-, and graph-based metrics to capture differences between a sampled FOL and its corresponding ground-truth. We then measure the alignment between a metric-based ranking of FOL outputs and a strong LLM as-a-judge. To do this, we first apply operator and text-based perturbations to ground-truth FOL statements to assess metric sensitivity. We then evaluate metric robustness by comparing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsConsumer Retail Behavior Studies · Aviation Industry Analysis and Trends · Customer churn and segmentation
