ConsumerBench: Benchmarking Generative AI Applications on End-User Devices
Yile Gu, Rohan Kadekodi, Hoang Nguyen, Keisuke Kamahori, Yiyu Liu, Baris Kasikci

TL;DR
ConsumerBench is a benchmarking framework that evaluates the efficiency and response times of Generative AI applications on end-user devices, addressing resource sharing and scheduling challenges in constrained hardware environments.
Contribution
It introduces a comprehensive, realistic benchmarking framework for GenAI on end-user devices, including multi-application scenarios and customizable workflows.
Findings
Identifies resource sharing inefficiencies and unfair scheduling.
Highlights performance issues with static model server configurations.
Recommends custom kernels and SLO-aware scheduling for improved performance.
Abstract
The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth.…
Peer Reviews
Decision·Submitted to ICLR 2026
Benchmarking hardware is an important problem. This paper aims to improve the current State-of-the-art of edge/small device benchmarking by benchmarking workflows of tasks.
My main issue is with the novelty and depth of the work. For example, as a benchmark, when comparing the suggested benchmark with PalmBench (cited in the paper), the authors have a much more limited set of applications/Models. When it comes to insights from the experiments, the results are very well known. For example, the KTransofrmer project (open-sourced with a paper in SOSP 2025) has been setup to solve many of the insights discussed. Another issue, for a benchmarking paper, one typically
The paper identifies an useful, practical gap in benchmarking applications under constrained on-device resources. It also presents careful thought out workflow and analysis, along with an example usecase where the use of such a benchmark/framework might help bring more insights on the concurrent execution and where the limits might occur. The findings are well written and presented with clarity.
The paper provides a benchmark that is useful in terms of software aspects for applications running concurrently on constrained resources. However, it hasn't mentioned any consideration for hardware impacts, which can further influence the performance of the applications on the end-user devices. Furthermore, the paper could explain more on the usecases and usefulness of the benchmark, such as how the workflow/findings scale and provide systematic insights across different architecture and types
1. Novel concurrency focus: The paper addresses multi-application inference, a relatively unexplored yet practically important problem for end-user AI systems. 2. The experimental results yield clear, practical takeaways for developers seeking to improve performance and fairness when multiple generative AI applications share hardware resources. 3. The use of YAML-based configuration and the DAG-based task execution model make the framework easily extensible — new applications, models, or metrics
1. Despite claiming a focus on “end-user devices,” all experiments are performed on a single workstation with an RTX 6000 GPU. Evaluations on consumer-class GPUs (e.g., RTX 4060/4070) or integrated accelerators would strengthen the paper’s external validity. 2. Each task uses a fixed model configuration. Demonstrating results across multiple models per modality would better validate that the benchmark’s findings generalize beyond specific architectures. 3. The paper discusses SLO-aware resource
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Cloud Computing and Resource Management
