LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Jianing Wang; Jianfei Zhang; Qi Guo; Linsen Guo; Rumei Li; Chao Zhang; Chong Peng; Cunguang Wang; Dengchang Zhao; Jiarong Shi; Jingang Wang; Liulin Feng; Mengxia Shen; Qi Li; Shengnan An; Shun Wang; Wei Shi; Xiangyu Xi; Xiaoyu Li; Xuezhi Cao; Yi Lu; Yunke Zhao; Zhengyu Chen; Zhimin Lin; Wei Wang; Peng Pei; Xunliang Cai

arXiv:2603.21065·cs.AI·March 24, 2026

LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, Jingang Wang, Liulin Feng, Mengxia Shen, Qi Li, Shengnan An, Shun Wang, Wei Shi, Xiangyu Xi, Xiaoyu Li, Xuezhi Cao, Yi Lu, Yunke Zhao, Zhengyu Chen

PDF

Open Access 1 Models

TL;DR

LongCat-Flash-Prover is a large-scale MoE model that significantly advances formal reasoning in Lean4 by integrating agentic tool-based reinforcement learning, achieving state-of-the-art results in theorem proving and auto-formalization tasks.

Contribution

The paper introduces a novel hybrid-experts iteration framework and a hierarchical importance sampling policy optimization for training large MoE models on long-horizon formal reasoning tasks.

Findings

01

Achieves 97.1% pass rate on MiniF2F with minimal inference budget

02

Solves 70.8% of ProverBench problems efficiently

03

Outperforms existing open-weights models in formal reasoning benchmarks

Abstract

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
meituan-longcat/LongCat-Flash-Prover
model· 355 dl· ♡ 27
355 dl♡ 27

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Ethics and Social Impacts of AI · Reinforcement Learning in Robotics