Evaluating Semantic and Syntactic Understanding in Large Language Models for Payroll Systems
Hendrika Maclean, Mert Can Cakmak, Muzakkiruddin Ahmed Mohammed, Shames Al Mandalawi, John Talburt

TL;DR
This paper assesses large language models' ability to understand and accurately perform payroll calculations, highlighting their strengths and limitations in high-stakes, precise tasks.
Contribution
It provides a systematic evaluation framework for LLMs on payroll tasks, revealing when careful prompting suffices and when explicit computation is necessary.
Findings
Models perform well with careful prompting on simple tasks
Explicit computation is needed for complex, high-accuracy requirements
The study offers practical guidance for deploying LLMs in sensitive domains
Abstract
Large language models are now used daily for writing, search, and analysis, and their natural language understanding continues to improve. However, they remain unreliable on exact numerical calculation and on producing outputs that are straightforward to audit. We study synthetic payroll system as a focused, high-stakes example and evaluate whether models can understand a payroll schema, apply rules in the right order, and deliver cent-accurate results. Our experiments span a tiered dataset from basic to complex cases, a spectrum of prompts from minimal baselines to schema-guided and reasoning variants, and multiple model families including GPT, Claude, Perplexity, Grok and Gemini. Results indicate clear regimes where careful prompting is sufficient and regimes where explicit computation is required. The work offers a compact, reproducible framework and practical guidance for deploying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
