Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study

Gabby Sanchez; Sneha Oommen; Cassandra T. Britto; Di Wang; Jung-De Chiou; Maria Spichkova

arXiv:2604.01615·cs.AI·April 3, 2026

Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study

Gabby Sanchez, Sneha Oommen, Cassandra T. Britto, Di Wang, Jung-De Chiou, Maria Spichkova

PDF

TL;DR

This study systematically evaluates AWS Bedrock's LLMs for receipt-item categorisation, focusing on accuracy, stability, and cost, and finds Claude 3.7 Sonnet offers the best accuracy-cost trade-off.

Contribution

It provides a cost-aware comparison of instruction-tuned LLMs for receipt categorisation and identifies optimal prompting strategies for efficiency.

Findings

01

Claude 3.7 Sonnet balances accuracy and cost effectively

02

Zero-shot prompting is suitable for cost-efficient accuracy

03

Performance varies significantly across models and prompting methods

Abstract

This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.