TL;DR
ActuBench introduces a multi-agent LLM pipeline for generating and evaluating actuarial assessment items, with a web interface and comprehensive benchmarking of 50 models.
Contribution
The paper presents a novel multi-agent pipeline for automated actuarial item creation and evaluation, including a web-based leaderboard and extensive model benchmarking.
Findings
Verification flags most drafted items on first pass
Locally-hosted open-weights inference offers cost-effective performance
MCQ and LLM-as-Judge rankings differ significantly
Abstract
We present ActuBench, a multi-agent LLM pipeline for the automated generation and evaluation of advanced actuarial assessment items aligned with the International Actuarial Association (IAA) Education Syllabus. The pipeline separates four LLM roles by adapter: one agent drafts items, one constructs distractors, a third independently verifies both stages and drives bounded one-shot repair loops, and a cost-optimized auxiliary agent handles Wikipedia-note summarization and topic labelling. The items, per-model responses and complete leaderboard are published as a browsable web interface at https://actubench.de/en/, allowing readers and practitioners to inspect individual items without a repository checkout. We evaluate 50 language models from eight providers on two complementary benchmarks -- 100 empirically hardest multiple-choice items and 100 open-ended items scored by an LLM judge --…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
