Bench360: Benchmarking Local LLM Inference from 360 Degrees
Linus Stuhlmann, Mauricio Fadel Argerich, Jonathan F\"urst

TL;DR
Bench360 is a comprehensive benchmarking framework for evaluating local large language model inference across diverse tasks, system metrics, and configurations, aiding deployment decisions.
Contribution
It introduces a unified platform supporting multiple inference engines, quantization formats, and custom tasks, filling gaps left by fragmented existing benchmarks.
Findings
Tradeoffs between efficiency and quality are significant.
Configuration choices depend on specific workloads and constraints.
No universal best configuration exists for local LLM inference.
Abstract
Running LLMs locally has become increasingly common, but users face a complex design space across models, quantization levels, inference engines, and serving scenarios. Existing inference benchmarks are fragmented and focus on isolated goals, offering little guidance for practical deployments. We present Bench360, a framework for evaluating local LLM inference across tasks, usage patterns, and system metrics in one place. Bench360 supports custom tasks, integrates multiple inference engines and quantization formats, and reports both task quality and system behavior (latency, throughput, energy, startup time). We demonstrate it on four NLP tasks across three GPUs and four engines, showing how design choices shape efficiency and output quality. Results confirm that tradeoffs are substantial and configuration choices depend on specific workloads and constraints. There is no universal best…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Software Engineering Research · Scientific Computing and Data Management
