Evaluating AI Meeting Summaries with a Reusable Cross-Domain Pipeline
Philip Zhong, Don Wang, Jason Zhang

TL;DR
This paper introduces a reusable, privacy-aware evaluation system for AI meeting summaries, benchmarking multiple models across diverse domains with detailed reporting and analysis.
Contribution
It presents a novel evaluation pipeline combining structured ground-truth, claim-grounded scoring, and online monitoring, enabling cross-domain reuse and detailed model comparison.
Findings
GPT-4.1-mini has the highest mean accuracy (0.583)
GPT-5.1 leads in retention, completeness, and coverage
Accuracy differences are not statistically significant across models
Abstract
Industrial teams often deploy large language model features before stable regression or model selection evaluation exists. We present a reusable evaluation system for AI meeting summaries that combines structured ground-truth (GT) construction, fixed candidate generation, claim-grounded scoring, persisted reporting, and a privacy-bounded online monitoring and nomination interface. The online evidence is not itself a benchmark: privacy-safe aggregate exports show active monitoring, hard regime detection, and directional movement without exposing customer data. We benchmark the offline path on 114 meetings across city_council, private_data, and whitehouse_press_briefings, yielding 340 completed meeting-model pairs and 680 judge runs for gpt-4.1-mini, gpt-5-mini, and gpt-5.1. Under this fixed protocol, accuracy differences are not statistically significant under Holm correction (corrected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
