Speed and Conversational Large Language Models: Not All Is About Tokens   per Second

Javier Conde; Miguel Gonz\'alez; Pedro Reviriego; Zhen Gao; Shanshan; Liu; Fabrizio Lombardi

arXiv:2502.16721·cs.CL·February 25, 2025

Speed and Conversational Large Language Models: Not All Is About Tokens per Second

Javier Conde, Miguel Gonz\'alez, Pedro Reviriego, Zhen Gao, Shanshan, Liu, Fabrizio Lombardi

PDF

TL;DR

This paper analyzes the actual speed of open-weight LLMs on GPUs, highlighting that speed is influenced by task specifics and not solely by token processing rates.

Contribution

It provides a comparative analysis of open LLMs' speed on GPUs, emphasizing task-dependent performance factors beyond token throughput.

Findings

01

Speed varies significantly with task type and model architecture.

02

Token per second metrics do not fully capture real-world performance.

03

Open LLMs exhibit diverse speed profiles depending on workload.

Abstract

The speed of open-weights large language models (LLMs) and its dependency on the task at hand, when run on GPUs, is studied to present a comparative analysis of the speed of the most popular open LLMs.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings