Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Philip Zhong; Don Wang; Jason Zhang

arXiv:2604.21345·cs.AI·May 14, 2026

Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline

Philip Zhong, Don Wang, Jason Zhang

PDF

TL;DR

This paper introduces a reusable, privacy-aware evaluation system for AI meeting summaries, benchmarking multiple models across diverse domains with detailed reporting and analysis.

Contribution

It presents a novel evaluation pipeline combining structured ground-truth, claim-grounded scoring, and online monitoring, enabling cross-domain reuse and detailed model comparison.

Findings

01

GPT-4.1-mini has the highest mean accuracy (0.583)

02

GPT-5.1 leads in retention, completeness, and coverage

03

Accuracy differences are not statistically significant across models

Abstract

Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.