Benchmarking Local Language Models for Social Robots using Edge Devices
Dorian Lamouille, Matev\v{z} B. Zorec, Farnaz Baksh, Karl Kruusam\"ae

TL;DR
This study systematically benchmarks 25 open-source language models on edge devices for social robots, evaluating efficiency, knowledge, and teaching effectiveness to inform deployment strategies.
Contribution
It provides a comprehensive comparison of models for pedagogical social robots, highlighting trade-offs and proposing a three-tier inference architecture for resource-limited hardware.
Findings
Granite4 Tiny Hybrid (7B) balances speed, energy, and accuracy.
MMLU accuracy ranges from near-random to 57.2%.
Teaching effectiveness does not correlate directly with efficiency or knowledge.
Abstract
Social-educational robots designed for socially interactive pedagogical support, such as the Robot Study Companion (RSC), rely on responsive, privacy-preserving interaction despite severely limited compute. However, there is a gap in systematic benchmarking of language models for edge computing in pedagogical applications. This paper benchmarks 25 open-source language models for local deployment on edge hardware. We evaluate each model across three dimensions: inference efficiency (tokens per second, energy consumption), general knowledge (a six-category MMLU subset), and teaching effectiveness (LLM-rated pedagogical quality), validated against five independent human raters using the Raspberry Pi(RPi)4 as the primary platform, with additional comparisons on the RPi5 and a laptop GPU. Results reveal pronounced trade-offs: throughput and energy efficiency vary by over an order of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
