Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with   Autoformalization

Jin Peng Zhou; Charles Staats; Wenda Li; Christian Szegedy; Kilian Q.; Weinberger; Yuhuai Wu

arXiv:2403.18120·cs.AI·March 28, 2024·3 cites

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization

Jin Peng Zhou, Charles Staats, Wenda Li, Christian Szegedy, Kilian Q., Weinberger, Yuhuai Wu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a method to improve the accuracy of large language models in solving mathematical problems by autoformalizing their reasoning steps into verified formal code, effectively reducing errors.

Contribution

The paper proposes leveraging autoformalization of LLM outputs into formal theorem proving environments to verify and reject incorrect solutions, enhancing reasoning reliability.

Findings

01

Improves GSM8K accuracy by over 12% using autoformalization.

02

Consistent performance gains across multiple datasets and model sizes.

03

Provides a verification mechanism to automatically reject inconsistent solutions.

Abstract

Large language models (LLM), such as Google's Minerva and OpenAI's GPT families, are becoming increasingly capable of solving mathematical quantitative reasoning problems. However, they still make unjustified logical and computational errors in their reasoning steps and answers. In this paper, we leverage the fact that if the training corpus of LLMs contained sufficiently many examples of formal mathematics (e.g. in Isabelle, a formal theorem proving environment), they can be prompted to translate i.e. autoformalize informal mathematical statements into formal Isabelle code -- which can be verified automatically for internal consistency. This provides a mechanism to automatically reject solutions whose formalized versions are inconsistent within themselves or with the formalized problem statement. We evaluate our method on GSM8K, MATH and MultiArith datasets and demonstrate that our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jinpz/dtv
noneOfficial

Videos

Don't Trust: Verify -- Grounding LLM Quantitative Reasoning with Autoformalization· slideslive

Taxonomy

TopicsArtificial Intelligence in Law

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Linear Layer · Attention Dropout · Residual Connection · Cosine Annealing · Multi-Head Attention · Linear Warmup With Cosine Annealing · Discriminative Fine-Tuning · Softmax