ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

Jan-Philipp Schmidt

arXiv:2604.20273·cs.AI·April 23, 2026

ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks

Jan-Philipp Schmidt

PDF

1 Repo

TL;DR

ActuBench introduces a multi-agent LLM pipeline for generating and evaluating actuarial assessment items, with a web interface and comprehensive benchmarking of 50 models.

Contribution

The paper presents a novel multi-agent pipeline for automated actuarial item creation and evaluation, including a web-based leaderboard and extensive model benchmarking.

Findings

01

Verification flags most drafted items on first pass

02

Locally-hosted open-weights inference offers cost-effective performance

03

MCQ and LLM-as-Judge rankings differ significantly

Abstract

We present ActuBench, a multi-agent LLM pipeline for the automated generation and evaluation of advanced actuarial assessment items aligned with the International Actuarial Association (IAA) Education Syllabus. The pipeline separates four LLM roles by adapter: one agent drafts items, one constructs distractors, a third independently verifies both stages and drives bounded one-shot repair loops, and a cost-optimized auxiliary agent handles Wikipedia-note summarization and topic labelling. The items, per-model responses and complete leaderboard are published as a browsable web interface at https://actubench.de/en/, allowing readers and practitioners to inspect individual items without a repository checkout. We evaluate 50 language models from eight providers on two complementary benchmarks -- 100 empirically hardest multiple-choice items and 100 open-ended items scored by an LLM judge --…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://actubench.de/en
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.